Compare commits

..

1 Commits

Author SHA1 Message Date
Claudio Atzori a11ad946fb bumped pom 2023-12-25 11:02:35 +01:00
416 changed files with 3654 additions and 16154 deletions

View File

@ -1,43 +0,0 @@
# Contributor Code of Conduct
Openness, transparency and our community-driven participatory approach guide us in our day-to-day interactions and decision-making. Our open source projects are no exception. Trust, respect, collaboration and transparency are core values we believe should live and breathe within our projects. Our community welcomes participants from around the world with different experiences, unique perspectives, and great ideas to share.
## Our Pledge
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment include:
- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Attempting collaboration before conflict
- Focusing on what is best for the community
- Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
- Violence, threats of violence, or inciting others to commit self-harm
- The use of sexualized language or imagery and unwelcome sexual attention or advances
- Trolling, intentionally spreading misinformation, insulting/derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or electronic address, without explicit permission
- Abuse of the reporting process to intentionally harass or exclude others
- Advocating for, or encouraging, any of the above behavior
- Other conduct which could reasonably be considered inappropriate in a professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), [version 1.4](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html).

View File

@ -1,10 +0,0 @@
# Contributing to D-Net Hadoop
:+1::tada: First off, thanks for taking the time to contribute! :tada::+1:
This project and everyone participating in it is governed by our [Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
The following is a set of guidelines for contributing to this project and its packages. These are mostly guidelines, not rules, which applies to this project as a while, including all its sub-modules.
Use your best judgment, and feel free to propose changes to this document in a pull request.
All contributions are welcome, all contributions will be considered to be contributed under the [project license](LICENSE.md).

View File

View File

@ -2,11 +2,6 @@
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
This project adheres to the Contributor Covenant [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
This project is licensed under the [AGPL v3 or later version](#LICENSE.md).
How to build, package and run oozie workflows
====================

View File

@ -1,39 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
public class CategorySummary {
private String id;
private String label;
private boolean hasConcept;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public boolean isHasConcept() {
return hasConcept;
}
public CategorySummary setId(final String id) {
this.id = id;
return this;
}
public CategorySummary setLabel(final String label) {
this.label = label;
return this;
}
public CategorySummary setHasConcept(final boolean hasConcept) {
this.hasConcept = hasConcept;
return this;
}
}

View File

@ -1,7 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class CategorySummaryList extends ArrayList<CategorySummary> {
}

View File

@ -1,52 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.List;
public class ConceptSummary {
private String id;
private String label;
public boolean hasSubConcept;
private List<ConceptSummary> concepts;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public List<ConceptSummary> getConcepts() {
return concepts;
}
public ConceptSummary setId(final String id) {
this.id = id;
return this;
}
public ConceptSummary setLabel(final String label) {
this.label = label;
return this;
}
public boolean isHasSubConcept() {
return hasSubConcept;
}
public ConceptSummary setHasSubConcept(final boolean hasSubConcept) {
this.hasSubConcept = hasSubConcept;
return this;
}
public ConceptSummary setConcept(final List<ConceptSummary> concepts) {
this.concepts = concepts;
return this;
}
}

View File

@ -1,7 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ConceptSummaryList extends ArrayList<ConceptSummary> {
}

View File

@ -1,50 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
public class ContextSummary {
private String id;
private String label;
private String type;
private String status;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public String getType() {
return type;
}
public String getStatus() {
return status;
}
public ContextSummary setId(final String id) {
this.id = id;
return this;
}
public ContextSummary setLabel(final String label) {
this.label = label;
return this;
}
public ContextSummary setType(final String type) {
this.type = type;
return this;
}
public ContextSummary setStatus(final String status) {
this.status = status;
return this;
}
}

View File

@ -1,7 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ContextSummaryList extends ArrayList<ContextSummary> {
}

View File

@ -8,13 +8,10 @@ import java.io.InputStream;
import java.net.*;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.math.NumberUtils;
import org.apache.commons.lang3.time.DateUtils;
import org.apache.http.HttpHeaders;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -97,16 +94,14 @@ public class HttpConnector2 {
throw new CollectorException(msg);
}
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
InputStream input = null;
long start = System.currentTimeMillis();
try {
if (getClientParams().getRequestDelay() > 0) {
backoffAndSleep(getClientParams().getRequestDelay());
}
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
urlConn.setInstanceFollowRedirects(false);
urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
@ -120,8 +115,9 @@ public class HttpConnector2 {
urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
}
}
logHeaderFields(urlConn);
if (log.isDebugEnabled()) {
logHeaderFields(urlConn);
}
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
@ -136,7 +132,9 @@ public class HttpConnector2 {
}
if (is2xx(urlConn.getResponseCode())) {
return getInputStream(urlConn, start);
input = urlConn.getInputStream();
responseType = urlConn.getContentType();
return input;
}
if (is3xx(urlConn.getResponseCode())) {
// REDIRECTS
@ -146,7 +144,6 @@ public class HttpConnector2 {
.put(
REPORT_PREFIX + urlConn.getResponseCode(),
String.format("Moved to: %s", newUrl));
logRequestTime(start);
urlConn.disconnect();
if (retryAfter > 0) {
backoffAndSleep(retryAfter);
@ -162,50 +159,26 @@ public class HttpConnector2 {
if (retryAfter > 0) {
log
.warn(
"waiting and repeating request after suggested retry-after {} sec for URL {}",
retryAfter, requestUrl);
"{} - waiting and repeating request after suggested retry-after {} sec.",
requestUrl, retryAfter);
backoffAndSleep(retryAfter * 1000);
} else {
log
.warn(
"waiting and repeating request after default delay of {} sec for URL {}",
getClientParams().getRetryDelay(), requestUrl);
backoffAndSleep(retryNumber * getClientParams().getRetryDelay());
"{} - waiting and repeating request after default delay of {} sec.",
requestUrl, getClientParams().getRetryDelay());
backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000);
}
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
logRequestTime(start);
urlConn.disconnect();
return attemptDownload(requestUrl, retryNumber + 1, report);
case 422: // UNPROCESSABLE ENTITY
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
log.warn("waiting and repeating request after 10 sec for URL {}", requestUrl);
backoffAndSleep(10000);
urlConn.disconnect();
logRequestTime(start);
try {
return getInputStream(urlConn, start);
} catch (IOException e) {
log
.error(
"server returned 422 and got IOException accessing the response body from URL {}",
requestUrl);
log.error("IOException:", e);
return attemptDownload(requestUrl, retryNumber + 1, report);
}
default:
log.error("gor error {} from URL: {}", urlConn.getResponseCode(), urlConn.getURL());
log.error("response message: {}", urlConn.getResponseMessage());
report
.put(
REPORT_PREFIX + urlConn.getResponseCode(),
String
.format(
"%s Error: %s", requestUrl, urlConn.getResponseMessage()));
logRequestTime(start);
urlConn.disconnect();
throw new CollectorException(urlConn.getResponseCode() + " error " + report);
}
}
@ -226,27 +199,13 @@ public class HttpConnector2 {
}
}
private InputStream getInputStream(HttpURLConnection urlConn, long start) throws IOException {
InputStream input = urlConn.getInputStream();
responseType = urlConn.getContentType();
logRequestTime(start);
return input;
}
private static void logRequestTime(long start) {
log
.info(
"request time elapsed: {}sec",
TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis() - start));
}
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
log.info("Response: {} - {}", urlConn.getResponseCode(), urlConn.getResponseMessage());
log.debug("StatusCode: {}", urlConn.getResponseMessage());
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
if (e.getKey() != null) {
for (String v : e.getValue()) {
log.info(" key: {} - value: {}", e.getKey(), v);
log.debug(" key: {} - value: {}", e.getKey(), v);
}
}
}
@ -266,7 +225,7 @@ public class HttpConnector2 {
for (String key : headerMap.keySet()) {
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) {
return Integer.parseInt(headerMap.get(key).get(0));
return Integer.parseInt(headerMap.get(key).get(0)) + 10;
}
}
return -1;

View File

@ -63,10 +63,7 @@ public class Vocabulary implements Serializable {
}
public VocabularyTerm getTermBySynonym(final String syn) {
return Optional
.ofNullable(syn)
.map(s -> getTerm(synonyms.get(s.toLowerCase())))
.orElse(null);
return getTerm(synonyms.get(syn.toLowerCase()));
}
public Qualifier getTermAsQualifier(final String termId) {

View File

@ -135,24 +135,6 @@ public class VocabularyGroup implements Serializable {
return vocs.get(vocId.toLowerCase()).getSynonymAsQualifier(syn);
}
public Qualifier lookupTermBySynonym(final String vocId, final String syn) {
return find(vocId)
.map(
vocabulary -> Optional
.ofNullable(vocabulary.getTerm(syn))
.map(
term -> OafMapperUtils
.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
.orElse(
Optional
.ofNullable(vocabulary.getTermBySynonym(syn))
.map(
term -> OafMapperUtils
.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
.orElse(null)))
.orElse(null);
}
/**
* getSynonymAsQualifierCaseSensitive
*

View File

@ -119,131 +119,6 @@ public class AuthorMerger {
});
}
public static String normalizeFullName(final String fullname) {
return nfd(fullname)
.toLowerCase()
// do not compact the regexes in a single expression, would cause StackOverflowError
// in case
// of large input strings
.replaceAll("(\\W)+", " ")
.replaceAll("(\\p{InCombiningDiacriticalMarks})+", " ")
.replaceAll("(\\p{Punct})+", " ")
.replaceAll("(\\d)+", " ")
.replaceAll("(\\n)+", " ")
.trim();
}
private static String authorFieldToBeCompared(Author author) {
if (StringUtils.isNotBlank(author.getSurname())) {
return author.getSurname();
}
if (StringUtils.isNotBlank(author.getFullname())) {
return author.getFullname();
}
return null;
}
/**
* This method tries to figure out when two author are the same in the contest
* of ORCID enrichment
*
* @param left Author in the OAF entity
* @param right Author ORCID
* @return based on a heuristic on the names of the authors if they are the same.
*/
public static boolean checkORCIDSimilarity(final Author left, final Author right) {
final Person pl = parse(left);
final Person pr = parse(right);
// If one of them didn't have a surname we verify if they have the fullName not empty
// and verify if the normalized version is equal
if (!(pl.getSurname() != null && pl.getSurname().stream().anyMatch(StringUtils::isNotBlank) &&
pr.getSurname() != null && pr.getSurname().stream().anyMatch(StringUtils::isNotBlank))) {
if (pl.getFullname() != null && !pl.getFullname().isEmpty() && pr.getFullname() != null
&& !pr.getFullname().isEmpty()) {
return pl
.getFullname()
.stream()
.anyMatch(
fl -> pr.getFullname().stream().anyMatch(fr -> normalize(fl).equalsIgnoreCase(normalize(fr))));
} else {
return false;
}
}
// The Authors have one surname in common
if (pl.getSurname().stream().anyMatch(sl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(sl)))) {
// If one of them has only a surname and is the same we can say that they are the same author
if ((pl.getName() == null || pl.getName().stream().allMatch(StringUtils::isBlank)) ||
(pr.getName() == null || pr.getName().stream().allMatch(StringUtils::isBlank)))
return true;
// The authors have the same initials of Name in common
if (pl
.getName()
.stream()
.anyMatch(
nl -> pr
.getName()
.stream()
.anyMatch(nr -> nr.equalsIgnoreCase(nl))))
return true;
}
// Sometimes we noticed that publication have author wrote in inverse order Surname, Name
// We verify if we have an exact match between name and surname
if (pl.getSurname().stream().anyMatch(sl -> pr.getName().stream().anyMatch(nr -> nr.equalsIgnoreCase(sl))) &&
pl.getName().stream().anyMatch(nl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(nl))))
return true;
else
return false;
}
//
/**
* Method to enrich ORCID information in one list of authors based on another list
*
* @param baseAuthor the Author List in the OAF Entity
* @param orcidAuthor The list of ORCID Author intersected
* @return The Author List of the OAF Entity enriched with the orcid Author
*/
public static List<Author> enrichOrcid(List<Author> baseAuthor, List<Author> orcidAuthor) {
if (baseAuthor == null || baseAuthor.isEmpty())
return orcidAuthor;
if (orcidAuthor == null || orcidAuthor.isEmpty())
return baseAuthor;
if (baseAuthor.size() == 1 && orcidAuthor.size() > 10)
return baseAuthor;
final List<Author> oAuthor = new ArrayList<>();
oAuthor.addAll(orcidAuthor);
baseAuthor.forEach(ba -> {
Optional<Author> aMatch = oAuthor.stream().filter(oa -> checkORCIDSimilarity(ba, oa)).findFirst();
if (aMatch.isPresent()) {
final Author sameAuthor = aMatch.get();
addPid(ba, sameAuthor.getPid());
oAuthor.remove(sameAuthor);
}
});
return baseAuthor;
}
private static void addPid(final Author a, final List<StructuredProperty> pids) {
if (a.getPid() == null) {
a.setPid(new ArrayList<>());
}
a.getPid().addAll(pids);
}
public static String pidToComparableString(StructuredProperty pid) {
final String classid = pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase()
: "";
@ -296,7 +171,7 @@ public class AuthorMerger {
}
}
public static String normalize(final String s) {
private static String normalize(final String s) {
String[] normalized = nfd(s)
.toLowerCase()
// do not compact the regexes in a single expression, would cause StackOverflowError

View File

@ -21,15 +21,10 @@ import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
import eu.dnetlib.dhp.schema.common.EntityType;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.utils.GraphCleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
import scala.Tuple2;
/**
@ -40,12 +35,6 @@ public class GroupEntitiesSparkJob {
private static final Encoder<OafEntity> OAFENTITY_KRYO_ENC = Encoders.kryo(OafEntity.class);
private ArgumentApplicationParser parser;
public GroupEntitiesSparkJob(ArgumentApplicationParser parser) {
this.parser = parser;
}
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
@ -62,17 +51,6 @@ public class GroupEntitiesSparkJob {
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
final String isLookupUrl = parser.get("isLookupUrl");
log.info("isLookupUrl: {}", isLookupUrl);
final ISLookUpService isLookupService = ISLookupClientFactory.getLookUpService(isLookupUrl);
new GroupEntitiesSparkJob(parser).run(isSparkSessionManaged, isLookupService);
}
public void run(Boolean isSparkSessionManaged, ISLookUpService isLookUpService)
throws ISLookUpException {
String graphInputPath = parser.get("graphInputPath");
log.info("graphInputPath: {}", graphInputPath);
@ -82,21 +60,19 @@ public class GroupEntitiesSparkJob {
String outputPath = parser.get("outputPath");
log.info("outputPath: {}", outputPath);
boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
boolean filterInvisible = Boolean.valueOf(parser.get("filterInvisible"));
log.info("filterInvisible: {}", filterInvisible);
SparkConf conf = new SparkConf();
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(ModelSupport.getOafModelClasses());
final VocabularyGroup vocs = VocabularyGroup.loadVocsFromIS(isLookUpService);
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
HdfsSupport.remove(checkpointPath, spark.sparkContext().hadoopConfiguration());
groupEntities(spark, graphInputPath, checkpointPath, outputPath, filterInvisible, vocs);
groupEntities(spark, graphInputPath, checkpointPath, outputPath, filterInvisible);
});
}
@ -105,7 +81,7 @@ public class GroupEntitiesSparkJob {
String inputPath,
String checkpointPath,
String outputPath,
boolean filterInvisible, VocabularyGroup vocs) {
boolean filterInvisible) {
Dataset<OafEntity> allEntities = spark.emptyDataset(OAFENTITY_KRYO_ENC);
@ -130,14 +106,10 @@ public class GroupEntitiesSparkJob {
}
Dataset<?> groupedEntities = allEntities
.map(
(MapFunction<OafEntity, OafEntity>) entity -> GraphCleaningFunctions
.applyCoarVocabularies(entity, vocs),
OAFENTITY_KRYO_ENC)
.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
.reduceGroups((ReduceFunction<OafEntity>) OafMapperUtils::mergeEntities)
.reduceGroups((ReduceFunction<OafEntity>) (b, a) -> OafMapperUtils.mergeEntities(b, a))
.map(
(MapFunction<Tuple2<String, OafEntity>, Tuple2<String, OafEntity>>) t -> new Tuple2<>(
(MapFunction<Tuple2<String, OafEntity>, Tuple2<String, OafEntity>>) t -> new Tuple2(
t._2().getClass().getName(), t._2()),
Encoders.tuple(Encoders.STRING(), OAFENTITY_KRYO_ENC));

View File

@ -1,77 +0,0 @@
package eu.dnetlib.dhp.oozie;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import org.apache.commons.lang3.time.DurationFormatUtils;
import org.apache.commons.text.StringSubstitutor;
import org.apache.spark.SparkConf;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.io.Resources;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class RunSQLSparkJob {
private static final Logger log = LoggerFactory.getLogger(RunSQLSparkJob.class);
private final ArgumentApplicationParser parser;
public RunSQLSparkJob(ArgumentApplicationParser parser) {
this.parser = parser;
}
public static void main(String[] args) throws Exception {
Map<String, String> params = new HashMap<>();
for (int i = 0; i < args.length - 1; i++) {
if (args[i].startsWith("--")) {
params.put(args[i].substring(2), args[++i]);
}
}
/*
* String jsonConfiguration = IOUtils .toString( Objects .requireNonNull( RunSQLSparkJob.class
* .getResourceAsStream( "/eu/dnetlib/dhp/oozie/run_sql_parameters.json"))); final ArgumentApplicationParser
* parser = new ArgumentApplicationParser(jsonConfiguration); parser.parseArgument(args);
*/
Boolean isSparkSessionManaged = Optional
.ofNullable(params.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
URL url = com.google.common.io.Resources.getResource(params.get("sql"));
String raw_sql = Resources.toString(url, StandardCharsets.UTF_8);
String sql = StringSubstitutor.replace(raw_sql, params);
log.info("sql: {}", sql);
SparkConf conf = new SparkConf();
conf.set("hive.metastore.uris", params.get("hiveMetastoreUris"));
runWithSparkHiveSession(
conf,
isSparkSessionManaged,
spark -> {
for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {
log.info("executing: {}", statement);
long startTime = System.currentTimeMillis();
spark.sql(statement).show();
log
.info(
"executed in {}",
DurationFormatUtils.formatDuration(System.currentTimeMillis() - startTime, "HH:mm:ss.S"));
}
});
}
}

View File

@ -1,12 +1,8 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
import static eu.dnetlib.dhp.schema.common.ModelConstants.OPENAIRE_META_RESOURCE_TYPE;
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance;
import java.net.MalformedURLException;
import java.net.URL;
import java.time.LocalDate;
import java.time.ZoneId;
import java.time.format.DateTimeFormatter;
@ -23,7 +19,6 @@ import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
import eu.dnetlib.dhp.common.vocabulary.VocabularyTerm;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
@ -31,10 +26,6 @@ import me.xuender.unidecode.Unidecode;
public class GraphCleaningFunctions extends CleaningFunctions {
public static final String DNET_PUBLISHERS = "dnet:publishers";
public static final String DNET_LICENSES = "dnet:licenses";
public static final String ORCID_CLEANING_REGEX = ".*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9x]{4})";
public static final int ORCID_LEN = 19;
public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)";
@ -45,13 +36,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";
private static final Set<String> INVALID_AUTHOR_NAMES = new HashSet<>();
private static final Set<String> INVALID_URLS = new HashSet<>();
private static final Set<String> INVALID_URL_HOSTS = new HashSet<>();
private static final HashSet<String> PEER_REVIEWED_TYPES = new HashSet<>();
static {
@ -64,47 +48,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
PEER_REVIEWED_TYPES.add("Thesis");
PEER_REVIEWED_TYPES.add("Bachelor thesis");
PEER_REVIEWED_TYPES.add("Conference object");
INVALID_AUTHOR_NAMES.add("(:null)");
INVALID_AUTHOR_NAMES.add("(:unap)");
INVALID_AUTHOR_NAMES.add("(:tba)");
INVALID_AUTHOR_NAMES.add("(:unas)");
INVALID_AUTHOR_NAMES.add("(:unav)");
INVALID_AUTHOR_NAMES.add("(:unkn)");
INVALID_AUTHOR_NAMES.add("(:unkn) unknown");
INVALID_AUTHOR_NAMES.add(":none");
INVALID_AUTHOR_NAMES.add(":null");
INVALID_AUTHOR_NAMES.add(":unas");
INVALID_AUTHOR_NAMES.add(":unav");
INVALID_AUTHOR_NAMES.add(":unkn");
INVALID_AUTHOR_NAMES.add("[autor desconocido]");
INVALID_AUTHOR_NAMES.add("[s. n.]");
INVALID_AUTHOR_NAMES.add("[s.n]");
INVALID_AUTHOR_NAMES.add("[unknown]");
INVALID_AUTHOR_NAMES.add("anonymous");
INVALID_AUTHOR_NAMES.add("n.n.");
INVALID_AUTHOR_NAMES.add("nn");
INVALID_AUTHOR_NAMES.add("no name supplied");
INVALID_AUTHOR_NAMES.add("none");
INVALID_AUTHOR_NAMES.add("none available");
INVALID_AUTHOR_NAMES.add("not available not available");
INVALID_AUTHOR_NAMES.add("null &na;");
INVALID_AUTHOR_NAMES.add("null anonymous");
INVALID_AUTHOR_NAMES.add("unbekannt");
INVALID_AUTHOR_NAMES.add("unknown");
INVALID_URL_HOSTS.add("creativecommons.org");
INVALID_URL_HOSTS.add("www.academia.edu");
INVALID_URL_HOSTS.add("academia.edu");
INVALID_URL_HOSTS.add("researchgate.net");
INVALID_URL_HOSTS.add("www.researchgate.net");
INVALID_URLS.add("http://repo.scoap3.org/api");
INVALID_URLS.add("http://ora.ox.ac.uk/objects/uuid:");
INVALID_URLS.add("http://ntur.lib.ntu.edu.tw/news/agent_contract.pdf");
INVALID_URLS.add("https://media.springer.com/full/springer-instructions-for-authors-assets/pdf/SN_BPF_EN.pdf");
INVALID_URLS.add("http://www.tobaccoinduceddiseases.org/dl/61aad426c96519bea4040a374c6a6110/");
INVALID_URLS.add("https://www.bilboard.nl/verenigingsbladen/bestuurskundige-berichten");
}
public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
@ -312,8 +255,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
}
if (value instanceof Datasource) {
final Datasource d = (Datasource) value;
return Objects.nonNull(d.getOfficialname()) && StringUtils.isNotBlank(d.getOfficialname().getValue());
// nothing to evaluate here
} else if (value instanceof Project) {
final Project p = (Project) value;
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
@ -415,14 +357,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.getPublisher()
.getValue()
.replaceAll(NAME_CLEANING_REGEX, " "));
if (vocs.vocabularyExists(DNET_PUBLISHERS)) {
vocs
.find(DNET_PUBLISHERS)
.map(voc -> voc.getTermBySynonym(r.getPublisher().getValue()))
.map(VocabularyTerm::getName)
.ifPresent(publisher -> r.getPublisher().setValue(publisher));
}
}
}
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
@ -583,14 +517,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS));
}
if (Objects.nonNull(i.getLicense()) && Objects.nonNull(i.getLicense().getValue())) {
vocs
.find(DNET_LICENSES)
.map(voc -> voc.getTermBySynonym(i.getLicense().getValue()))
.map(VocabularyTerm::getId)
.ifPresent(license -> i.getLicense().setValue(license));
}
// from the script from Dimitris
if ("0000".equals(i.getRefereed().getClassid())) {
final boolean isFromCrossref = Optional
@ -632,15 +558,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
i.setFulltext(null);
}
if (Objects.nonNull(i.getUrl())) {
i
.setUrl(
i
.getUrl()
.stream()
.filter(GraphCleaningFunctions::urlFilter)
.collect(Collectors.toList()));
}
}
}
if (Objects.isNull(r.getBestaccessright())
@ -663,7 +580,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.getAuthor()
.stream()
.filter(Objects::nonNull)
.filter(GraphCleaningFunctions::isValidAuthorName)
.filter(a -> StringUtils.isNotBlank(a.getFullname()))
.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
.map(GraphCleaningFunctions::cleanupAuthor)
.collect(Collectors.toList()));
@ -690,9 +608,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(Objects::nonNull)
.filter(p -> Objects.nonNull(p.getQualifier()))
.filter(p -> StringUtils.isNotBlank(p.getValue()))
.filter(
p -> StringUtils
.contains(StringUtils.lowerCase(p.getQualifier().getClassid()), ORCID))
.map(p -> {
// hack to distinguish orcid from orcid_pending
String pidProvenance = getProvenance(p.getDataInfo());
@ -702,8 +617,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.toLowerCase()
.contains(ModelConstants.ORCID)) {
if (pidProvenance
.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY) ||
pidProvenance.equals("ORCID_ENRICHMENT")) {
.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY)) {
p.getQualifier().setClassid(ModelConstants.ORCID);
} else {
p.getQualifier().setClassid(ModelConstants.ORCID_PENDING);
@ -825,30 +739,12 @@ public class GraphCleaningFunctions extends CleaningFunctions {
// HELPERS
private static boolean isValidAuthorName(Author a) {
return StringUtils.isNotBlank(a.getFullname()) &&
StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")) &&
!INVALID_AUTHOR_NAMES.contains(StringUtils.lowerCase(a.getFullname()).trim()) &&
!Stream
.of(a.getFullname(), a.getName(), a.getSurname())
.filter(StringUtils::isNotBlank)
.collect(Collectors.joining(""))
.toLowerCase()
.matches(INVALID_AUTHOR_REGEX);
}
private static boolean urlFilter(String u) {
try {
final URL url = new URL(u);
if (StringUtils.isBlank(url.getPath()) || "/".equals(url.getPath())) {
return false;
}
if (INVALID_URL_HOSTS.contains(url.getHost())) {
return false;
}
return !INVALID_URLS.contains(url.toString());
} catch (MalformedURLException ex) {
return false;
}
return !Stream
.of(a.getFullname(), a.getName(), a.getSurname())
.filter(s -> s != null && !s.isEmpty())
.collect(Collectors.joining(""))
.toLowerCase()
.matches(INVALID_AUTHOR_REGEX);
}
private static List<StructuredProperty> processPidCleaning(List<StructuredProperty> pids) {
@ -898,105 +794,4 @@ public class GraphCleaningFunctions extends CleaningFunctions {
return s;
}
public static OafEntity applyCoarVocabularies(OafEntity entity, VocabularyGroup vocs) {
if (entity instanceof Result) {
final Result result = (Result) entity;
Optional
.ofNullable(result.getInstance())
.ifPresent(
instances -> instances
.forEach(
instance -> {
if (Objects.isNull(instance.getInstanceTypeMapping())) {
List<InstanceTypeMapping> mapping = Lists.newArrayList();
mapping
.add(
OafMapperUtils
.instanceTypeMapping(
instance.getInstancetype().getClassname(),
OPENAIRE_COAR_RESOURCE_TYPES_3_1));
instance.setInstanceTypeMapping(mapping);
}
Optional<InstanceTypeMapping> optionalItm = instance
.getInstanceTypeMapping()
.stream()
.filter(GraphCleaningFunctions::originalResourceType)
.findFirst();
if (optionalItm.isPresent()) {
InstanceTypeMapping coarItm = optionalItm.get();
Optional
.ofNullable(
vocs
.lookupTermBySynonym(
OPENAIRE_COAR_RESOURCE_TYPES_3_1, coarItm.getOriginalType()))
.ifPresent(type -> {
coarItm.setTypeCode(type.getClassid());
coarItm.setTypeLabel(type.getClassname());
});
final List<InstanceTypeMapping> mappings = Lists.newArrayList();
if (vocs.vocabularyExists(OPENAIRE_USER_RESOURCE_TYPES)) {
Optional
.ofNullable(
vocs
.lookupTermBySynonym(
OPENAIRE_USER_RESOURCE_TYPES, coarItm.getTypeCode()))
.ifPresent(
type -> mappings
.add(
OafMapperUtils
.instanceTypeMapping(coarItm.getTypeCode(), type)));
}
if (!mappings.isEmpty()) {
instance.getInstanceTypeMapping().addAll(mappings);
}
}
}));
result.setMetaResourceType(getMetaResourceType(result.getInstance(), vocs));
}
return entity;
}
private static boolean originalResourceType(InstanceTypeMapping itm) {
return StringUtils.isNotBlank(itm.getOriginalType()) &&
OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(itm.getVocabularyName()) &&
StringUtils.isBlank(itm.getTypeCode()) &&
StringUtils.isBlank(itm.getTypeLabel());
}
private static Qualifier getMetaResourceType(final List<Instance> instances, final VocabularyGroup vocs) {
return Optional
.ofNullable(instances)
.map(ii -> {
if (vocs.vocabularyExists(OPENAIRE_META_RESOURCE_TYPE)) {
Optional<InstanceTypeMapping> itm = ii
.stream()
.filter(Objects::nonNull)
.flatMap(
i -> Optional
.ofNullable(i.getInstanceTypeMapping())
.map(Collection::stream)
.orElse(Stream.empty()))
.filter(t -> OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(t.getVocabularyName()))
.findFirst();
if (!itm.isPresent() || Objects.isNull(itm.get().getTypeCode())) {
return null;
} else {
final String typeCode = itm.get().getTypeCode();
return Optional
.ofNullable(vocs.lookupTermBySynonym(OPENAIRE_META_RESOURCE_TYPE, typeCode))
.orElseThrow(
() -> new IllegalStateException("unable to find a synonym for '" + typeCode + "' in " +
OPENAIRE_META_RESOURCE_TYPE));
}
} else {
throw new IllegalStateException("vocabulary '" + OPENAIRE_META_RESOURCE_TYPE + "' not available");
}
})
.orElse(null);
}
}

View File

@ -14,6 +14,7 @@ import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.dhp.schema.common.AccessRightComparator;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
@ -140,28 +141,6 @@ public class OafMapperUtils {
.collect(Collectors.toList());
}
public static InstanceTypeMapping instanceTypeMapping(String originalType, String code, String label,
String vocabularyName) {
final InstanceTypeMapping m = new InstanceTypeMapping();
m.setVocabularyName(vocabularyName);
m.setOriginalType(originalType);
m.setTypeCode(code);
m.setTypeLabel(label);
return m;
}
public static InstanceTypeMapping instanceTypeMapping(String originalType, Qualifier term) {
return instanceTypeMapping(originalType, term.getClassid(), term.getClassname(), term.getSchemeid());
}
public static InstanceTypeMapping instanceTypeMapping(String originalType) {
return instanceTypeMapping(originalType, null, null, null);
}
public static InstanceTypeMapping instanceTypeMapping(String originalType, String vocabularyName) {
return instanceTypeMapping(originalType, null, null, vocabularyName);
}
public static Qualifier unknown(final String schemeid, final String schemename) {
return qualifier(UNKNOWN, "Unknown", schemeid, schemename);
}

View File

@ -28,11 +28,5 @@
"paramLongName": "filterInvisible",
"paramDescription": "if true filters out invisible entities",
"paramRequired": true
},
{
"paramName": "isu",
"paramLongName": "isLookupUrl",
"paramDescription": "url to the ISLookup Service",
"paramRequired": true
}
]

View File

@ -1,20 +0,0 @@
[
{
"paramName": "issm",
"paramLongName": "isSparkSessionManaged",
"paramDescription": "when true will stop SparkSession after job execution",
"paramRequired": false
},
{
"paramName": "hmu",
"paramLongName": "hiveMetastoreUris",
"paramDescription": "the hive metastore uris",
"paramRequired": true
},
{
"paramName": "sql",
"paramLongName": "sql",
"paramDescription": "sql script to execute",
"paramRequired": true
}
]

View File

@ -1,114 +0,0 @@
package eu.dnetlib.oa.merge;
import static org.junit.jupiter.api.Assertions.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.List;
import java.util.Objects;
import org.junit.jupiter.api.Test;
import org.junit.platform.commons.util.StringUtils;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
import eu.dnetlib.dhp.schema.oaf.Author;
public class AuthorMergerTest {
@Test
public void testEnrcichAuthor() throws Exception {
final ObjectMapper mapper = new ObjectMapper();
BufferedReader pr = new BufferedReader(new InputStreamReader(
Objects
.requireNonNull(
AuthorMergerTest.class
.getResourceAsStream("/eu/dnetlib/dhp/oa/merge/authors_publication_sample.json"))));
BufferedReader or = new BufferedReader(new InputStreamReader(
Objects
.requireNonNull(
AuthorMergerTest.class.getResourceAsStream("/eu/dnetlib/dhp/oa/merge/authors_orcid_sample.json"))));
TypeReference<List<Author>> aclass = new TypeReference<List<Author>>() {
};
String pubLine;
int i = 0;
while ((pubLine = pr.readLine()) != null) {
final String pubId = pubLine;
final String MatchPidOrcid = or.readLine();
final String pubOrcid = or.readLine();
final String data = pr.readLine();
if (StringUtils.isNotBlank(data)) {
List<Author> publicationAuthors = mapper.readValue(data, aclass);
List<Author> orcidAuthors = mapper.readValue(or.readLine(), aclass);
System.out.printf("OAF ID = %s \n", pubId);
System.out.printf("ORCID Intersected ID = %s \n", pubOrcid);
System.out.printf("OAF Author Size = %d \n", publicationAuthors.size());
System.out.printf("Oricd Author Size = %d \n", orcidAuthors.size());
System.out.printf("Oricd Matched PID = %s \n", MatchPidOrcid);
long originalAuthorWithPiD = publicationAuthors
.stream()
.filter(
a -> a.getPid() != null && a
.getPid()
.stream()
.anyMatch(
p -> p.getQualifier() != null
&& p.getQualifier().getClassid().toLowerCase().contains("orcid")))
.count();
long start = System.currentTimeMillis();
// final List<Author> enrichedList = AuthorMerger.enrichOrcid(publicationAuthors, orcidAuthors);
final List<Author> enrichedList = AuthorMerger.enrichOrcid(publicationAuthors, orcidAuthors);
long enrichedAuthorWithPid = enrichedList
.stream()
.filter(
a -> a.getPid() != null && a
.getPid()
.stream()
.anyMatch(
p -> p.getQualifier() != null
&& p.getQualifier().getClassid().toLowerCase().contains("orcid")))
.count();
long totalTime = (System.currentTimeMillis() - start) / 1000;
System.out
.printf(
"Enriched authors in %d seconds from %d pid to %d pid \n", totalTime, originalAuthorWithPiD,
enrichedAuthorWithPid);
System.out.println("=================");
}
}
}
@Test
public void checkSimilarityTest() {
final Author left = new Author();
left.setName("Anand");
left.setSurname("Rachna");
left.setFullname("Anand, Rachna");
System.out.println(AuthorMerger.normalizeFullName(left.getFullname()));
final Author right = new Author();
right.setName("Rachna");
right.setSurname("Anand");
right.setFullname("Rachna, Anand");
// System.out.println(AuthorMerger.normalize(right.getFullname()));
boolean same = AuthorMerger.checkORCIDSimilarity(left, right);
assertTrue(same);
}
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -14,9 +14,9 @@ import eu.dnetlib.pace.config.Config;
public abstract class AbstractClusteringFunction extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Object> params;
protected Map<String, Integer> params;
public AbstractClusteringFunction(final Map<String, Object> params) {
public AbstractClusteringFunction(final Map<String, Integer> params) {
this.params = params;
}
@ -27,7 +27,7 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(s -> normalize(s))
.map(this::normalize)
.map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))
@ -36,24 +36,11 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
.collect(Collectors.toCollection(HashSet::new));
}
public Map<String, Object> getParams() {
public Map<String, Integer> getParams() {
return params;
}
protected Integer param(String name) {
Object val = params.get(name);
if (val == null)
return null;
if (val instanceof Number) {
return ((Number) val).intValue();
}
return Integer.parseInt(val.toString());
}
protected int paramOrDefault(String name, int i) {
Integer res = param(name);
if (res == null)
res = i;
return res;
return params.get(name);
}
}

View File

@ -13,7 +13,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("acronyms")
public class Acronyms extends AbstractClusteringFunction {
public Acronyms(Map<String, Object> params) {
public Acronyms(Map<String, Integer> params) {
super(params);
}

View File

@ -11,6 +11,6 @@ public interface ClusteringFunction {
public Collection<String> apply(Config config, List<String> fields);
public Map<String, Object> getParams();
public Map<String, Integer> getParams();
}

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("immutablefieldvalue")
public class ImmutableFieldValue extends AbstractClusteringFunction {
public ImmutableFieldValue(final Map<String, Object> params) {
public ImmutableFieldValue(final Map<String, Integer> params) {
super(params);
}

View File

@ -1,69 +0,0 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import com.jayway.jsonpath.Configuration;
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import com.jayway.jsonpath.Option;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.util.MapDocumentUtil;
@ClusteringClass("jsonlistclustering")
public class JSONListClustering extends AbstractPaceFunctions implements ClusteringFunction {
private Map<String, Object> params;
public JSONListClustering(Map<String, Object> params) {
this.params = params;
}
@Override
public Map<String, Object> getParams() {
return params;
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(s -> doApply(conf, s))
.filter(StringUtils::isNotBlank)
.collect(Collectors.toCollection(HashSet::new));
}
private String doApply(Config conf, String json) {
StringBuilder st = new StringBuilder(); // to build the string used for comparisons basing on the jpath into
// parameters
final DocumentContext documentContext = JsonPath
.using(Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS))
.parse(json);
// for each path in the param list
for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
String path = params.get(key).toString();
String value = MapDocumentUtil.getJPathString(path, documentContext);
if (value == null || value.isEmpty())
value = "";
st.append(value);
st.append(" ");
}
st.setLength(st.length() - 1);
if (StringUtils.isBlank(st)) {
return "1";
}
return st.toString();
}
}

View File

@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("keywordsclustering")
public class KeywordsClustering extends AbstractClusteringFunction {
public KeywordsClustering(Map<String, Object> params) {
public KeywordsClustering(Map<String, Integer> params) {
super(params);
}
@ -19,8 +19,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
protected Collection<String> doApply(final Config conf, String s) {
// takes city codes and keywords codes without duplicates
Set<String> keywords = getKeywords(s, conf.translationMap(), paramOrDefault("windowSize", 4));
Set<String> cities = getCities(s, paramOrDefault("windowSize", 4));
Set<String> keywords = getKeywords(s, conf.translationMap(), params.getOrDefault("windowSize", 4));
Set<String> cities = getCities(s, params.getOrDefault("windowSize", 4));
// list of combination to return as result
final Collection<String> combinations = new LinkedHashSet<String>();
@ -28,7 +28,7 @@ public class KeywordsClustering extends AbstractClusteringFunction {
for (String keyword : keywordsToCodes(keywords, conf.translationMap())) {
for (String city : citiesToCodes(cities)) {
combinations.add(keyword + "-" + city);
if (combinations.size() >= paramOrDefault("max", 2)) {
if (combinations.size() >= params.getOrDefault("max", 2)) {
return combinations;
}
}
@ -42,8 +42,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(KeywordsClustering::cleanup)
.map(KeywordsClustering::normalize)
.map(this::cleanup)
.map(this::normalize)
.map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))

View File

@ -16,7 +16,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = true;
public LastNameFirstInitial(final Map<String, Object> params) {
public LastNameFirstInitial(final Map<String, Integer> params) {
super(params);
}
@ -25,7 +25,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(LastNameFirstInitial::normalize)
.map(this::normalize)
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))
.flatMap(c -> c.stream())
@ -33,7 +33,8 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
.collect(Collectors.toCollection(HashSet::new));
}
public static String normalize(final String s) {
@Override
protected String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
// strings

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("lowercase")
public class LowercaseClustering extends AbstractClusteringFunction {
public LowercaseClustering(final Map<String, Object> params) {
public LowercaseClustering(final Map<String, Integer> params) {
super(params);
}

View File

@ -12,11 +12,11 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("ngrampairs")
public class NgramPairs extends Ngrams {
public NgramPairs(Map<String, Object> params) {
public NgramPairs(Map<String, Integer> params) {
super(params, false);
}
public NgramPairs(Map<String, Object> params, boolean sorted) {
public NgramPairs(Map<String, Integer> params, boolean sorted) {
super(params, sorted);
}

View File

@ -10,11 +10,11 @@ public class Ngrams extends AbstractClusteringFunction {
private final boolean sorted;
public Ngrams(Map<String, Object> params) {
public Ngrams(Map<String, Integer> params) {
this(params, false);
}
public Ngrams(Map<String, Object> params, boolean sorted) {
public Ngrams(Map<String, Integer> params, boolean sorted) {
super(params);
this.sorted = sorted;
}

View File

@ -1,113 +0,0 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
import com.google.common.base.Splitter;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("numAuthorsTitleSuffixPrefixChain")
public class NumAuthorsTitleSuffixPrefixChain extends AbstractClusteringFunction {
public NumAuthorsTitleSuffixPrefixChain(Map<String, Object> params) {
super(params);
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
try {
int num_authors = Math.min(Integer.parseInt(fields.get(0)), 21); // SIZE threshold is 20, +1
if (num_authors > 0) {
return super.apply(conf, fields.subList(1, fields.size()))
.stream()
.map(s -> num_authors + "-" + s)
.collect(Collectors.toList());
}
} catch (NumberFormatException e) {
// missing or null authors array
}
return Collections.emptyList();
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return suffixPrefixChain(cleanup(s), param("mod"));
}
private Collection<String> suffixPrefixChain(String s, int mod) {
// create the list of words from the string (remove short words)
List<String> wordsList = Arrays
.stream(s.split(" "))
.filter(si -> si.length() > 3)
.collect(Collectors.toList());
final int words = wordsList.size();
final int letters = s.length();
// create the prefix: number of words + number of letters/mod
String prefix = words / mod + "-";
return doSuffixPrefixChain(wordsList, prefix);
}
private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
Set<String> set = Sets.newLinkedHashSet();
switch (wordsList.size()) {
case 0:
break;
case 1:
set.add(wordsList.get(0));
break;
case 2:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3));
break;
default:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3) +
suffix(wordsList.get(2), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3) +
prefix(wordsList.get(2), 3));
break;
}
return set;
}
private String suffix(String s, int len) {
return s.substring(s.length() - len);
}
private String prefix(String s, int len) {
return s.substring(0, len);
}
}

View File

@ -17,11 +17,11 @@ import eu.dnetlib.pace.model.Person;
@ClusteringClass("personClustering")
public class PersonClustering extends AbstractPaceFunctions implements ClusteringFunction {
private Map<String, Object> params;
private Map<String, Integer> params;
private static final int MAX_TOKENS = 5;
public PersonClustering(final Map<String, Object> params) {
public PersonClustering(final Map<String, Integer> params) {
this.params = params;
}
@ -77,7 +77,7 @@ public class PersonClustering extends AbstractPaceFunctions implements Clusterin
// }
@Override
public Map<String, Object> getParams() {
public Map<String, Integer> getParams() {
return params;
}

View File

@ -15,7 +15,7 @@ public class PersonHash extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = false;
public PersonHash(final Map<String, Object> params) {
public PersonHash(final Map<String, Integer> params) {
super(params);
}

View File

@ -8,7 +8,7 @@ import eu.dnetlib.pace.config.Config;
public class RandomClusteringFunction extends AbstractClusteringFunction {
public RandomClusteringFunction(Map<String, Object> params) {
public RandomClusteringFunction(Map<String, Integer> params) {
super(params);
}

View File

@ -1,10 +1,7 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.*;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
@ -15,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("sortedngrampairs")
public class SortedNgramPairs extends NgramPairs {
public SortedNgramPairs(Map<String, Object> params) {
public SortedNgramPairs(Map<String, Integer> params) {
super(params, false);
}

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("spacetrimmingfieldvalue")
public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
public SpaceTrimmingFieldValue(final Map<String, Object> params) {
public SpaceTrimmingFieldValue(final Map<String, Integer> params) {
super(params);
}
@ -25,7 +25,7 @@ public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
res
.add(
StringUtils.isBlank(s) ? RandomStringUtils.random(param("randomLength"))
StringUtils.isBlank(s) ? RandomStringUtils.random(getParams().get("randomLength"))
: s.toLowerCase().replaceAll("\\s+", ""));
return res;

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("suffixprefix")
public class SuffixPrefix extends AbstractClusteringFunction {
public SuffixPrefix(Map<String, Object> params) {
public SuffixPrefix(Map<String, Integer> params) {
super(params);
}

View File

@ -15,17 +15,12 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("urlclustering")
public class UrlClustering extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Object> params;
protected Map<String, Integer> params;
public UrlClustering(final Map<String, Object> params) {
public UrlClustering(final Map<String, Integer> params) {
this.params = params;
}
@Override
public Map<String, Object> getParams() {
return params;
}
@Override
public Collection<String> apply(final Config conf, List<String> fields) {
try {
@ -40,6 +35,11 @@ public class UrlClustering extends AbstractPaceFunctions implements ClusteringFu
}
}
@Override
public Map<String, Integer> getParams() {
return null;
}
private URL asUrl(String value) {
try {
return new URL(value);

View File

@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordsStatsSuffixPrefixChain")
public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
public WordsStatsSuffixPrefixChain(Map<String, Object> params) {
public WordsStatsSuffixPrefixChain(Map<String, Integer> params) {
super(params);
}

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordssuffixprefix")
public class WordsSuffixPrefix extends AbstractClusteringFunction {
public WordsSuffixPrefix(Map<String, Object> params) {
public WordsSuffixPrefix(Map<String, Integer> params) {
super(params);
}

View File

@ -16,6 +16,7 @@ import org.apache.commons.lang3.StringUtils;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
import com.google.common.collect.Iterables;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
import com.ibm.icu.text.Transliterator;
@ -26,7 +27,7 @@ import eu.dnetlib.pace.clustering.NGramUtils;
*
* @author claudio
*/
public class AbstractPaceFunctions {
public abstract class AbstractPaceFunctions {
// city map to be used when translating the city names into codes
private static Map<String, String> cityMap = AbstractPaceFunctions
@ -61,14 +62,11 @@ public class AbstractPaceFunctions {
private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
private static Pattern romanNumberPattern = Pattern
.compile("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$");
protected static String concat(final List<String> l) {
protected String concat(final List<String> l) {
return Joiner.on(" ").skipNulls().join(l);
}
public static String cleanup(final String s) {
protected String cleanup(final String s) {
final String s1 = HTML_REGEX.matcher(s).replaceAll("");
final String s2 = unicodeNormalization(s1.toLowerCase());
final String s3 = nfd(s2);
@ -84,7 +82,7 @@ public class AbstractPaceFunctions {
return s12;
}
protected static String fixXML(final String a) {
protected String fixXML(final String a) {
return a
.replaceAll("&ndash;", " ")
@ -93,7 +91,7 @@ public class AbstractPaceFunctions {
.replaceAll("&minus;", " ");
}
protected static boolean checkNumbers(final String a, final String b) {
protected boolean checkNumbers(final String a, final String b) {
final String numbersA = getNumbers(a);
final String numbersB = getNumbers(b);
final String romansA = getRomans(a);
@ -101,7 +99,7 @@ public class AbstractPaceFunctions {
return !numbersA.equals(numbersB) || !romansA.equals(romansB);
}
protected static String getRomans(final String s) {
protected String getRomans(final String s) {
final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) {
sb.append(isRoman(t) ? t : "");
@ -109,12 +107,13 @@ public class AbstractPaceFunctions {
return sb.toString();
}
protected static boolean isRoman(final String s) {
Matcher m = romanNumberPattern.matcher(s);
return m.matches() && m.hitEnd();
protected boolean isRoman(final String s) {
return s
.replaceAll("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", "qwertyuiop")
.equals("qwertyuiop");
}
protected static String getNumbers(final String s) {
protected String getNumbers(final String s) {
final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) {
sb.append(isNumber(t) ? t : "");
@ -122,7 +121,7 @@ public class AbstractPaceFunctions {
return sb.toString();
}
public static boolean isNumber(String strNum) {
public boolean isNumber(String strNum) {
if (strNum == null) {
return false;
}
@ -148,7 +147,7 @@ public class AbstractPaceFunctions {
}
}
protected static String removeSymbols(final String s) {
protected String removeSymbols(final String s) {
final StringBuilder sb = new StringBuilder();
s.chars().forEach(ch -> {
@ -158,11 +157,11 @@ public class AbstractPaceFunctions {
return sb.toString().replaceAll("\\s+", " ");
}
protected static boolean notNull(final String s) {
protected boolean notNull(final String s) {
return s != null;
}
public static String normalize(final String s) {
protected String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
.toLowerCase()
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
@ -175,16 +174,16 @@ public class AbstractPaceFunctions {
.trim();
}
public static String nfd(final String s) {
public String nfd(final String s) {
return Normalizer.normalize(s, Normalizer.Form.NFD);
}
public static String utf8(final String s) {
public String utf8(final String s) {
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
return new String(bytes, StandardCharsets.UTF_8);
}
public static String unicodeNormalization(final String s) {
public String unicodeNormalization(final String s) {
Matcher m = hexUnicodePattern.matcher(s);
StringBuffer buf = new StringBuffer(s.length());
@ -196,7 +195,7 @@ public class AbstractPaceFunctions {
return buf.toString();
}
protected static String filterStopWords(final String s, final Set<String> stopwords) {
protected String filterStopWords(final String s, final Set<String> stopwords) {
final StringTokenizer st = new StringTokenizer(s);
final StringBuilder sb = new StringBuilder();
while (st.hasMoreTokens()) {
@ -209,7 +208,7 @@ public class AbstractPaceFunctions {
return sb.toString().trim();
}
public static String filterAllStopWords(String s) {
public String filterAllStopWords(String s) {
s = filterStopWords(s, stopwords_en);
s = filterStopWords(s, stopwords_de);
@ -222,8 +221,7 @@ public class AbstractPaceFunctions {
return s;
}
protected static Collection<String> filterBlacklisted(final Collection<String> set,
final Set<String> ngramBlacklist) {
protected Collection<String> filterBlacklisted(final Collection<String> set, final Set<String> ngramBlacklist) {
final Set<String> newset = Sets.newLinkedHashSet();
for (final String s : set) {
if (!ngramBlacklist.contains(s)) {
@ -270,7 +268,7 @@ public class AbstractPaceFunctions {
return m;
}
public static String removeKeywords(String s, Set<String> keywords) {
public String removeKeywords(String s, Set<String> keywords) {
s = " " + s + " ";
for (String k : keywords) {
@ -280,39 +278,39 @@ public class AbstractPaceFunctions {
return s.trim();
}
public static double commonElementsPercentage(Set<String> s1, Set<String> s2) {
public double commonElementsPercentage(Set<String> s1, Set<String> s2) {
double longer = Math.max(s1.size(), s2.size());
return (double) s1.stream().filter(s2::contains).count() / longer;
}
// convert the set of keywords to codes
public static Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
public Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
return keywords.stream().map(s -> translationMap.get(s)).collect(Collectors.toSet());
}
public static Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
public Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
return toCodes(keywords, translationMap);
}
public static Set<String> citiesToCodes(Set<String> keywords) {
public Set<String> citiesToCodes(Set<String> keywords) {
return toCodes(keywords, cityMap);
}
protected static String firstLC(final String s) {
protected String firstLC(final String s) {
return StringUtils.substring(s, 0, 1).toLowerCase();
}
protected static Iterable<String> tokens(final String s, final int maxTokens) {
protected Iterable<String> tokens(final String s, final int maxTokens) {
return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens);
}
public static String normalizePid(String pid) {
public String normalizePid(String pid) {
return DOI_PREFIX.matcher(pid.toLowerCase()).replaceAll("");
}
// get the list of keywords into the input string
public static Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
public Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
String s = s1;
@ -342,7 +340,7 @@ public class AbstractPaceFunctions {
return codes;
}
public static Set<String> getCities(String s1, int windowSize) {
public Set<String> getCities(String s1, int windowSize) {
return getKeywords(s1, cityMap, windowSize);
}

View File

@ -18,7 +18,7 @@ public class ClusteringDef implements Serializable {
private List<String> fields;
private Map<String, Object> params;
private Map<String, Integer> params;
public ClusteringDef() {
}
@ -43,11 +43,11 @@ public class ClusteringDef implements Serializable {
this.fields = fields;
}
public Map<String, Object> getParams() {
public Map<String, Integer> getParams() {
return params;
}
public void setParams(final Map<String, Object> params) {
public void setParams(final Map<String, Integer> params) {
this.params = params;
}

View File

@ -2,7 +2,6 @@
package eu.dnetlib.pace.model;
import java.io.Serializable;
import java.util.HashSet;
import java.util.List;
import com.fasterxml.jackson.core.JsonProcessingException;
@ -37,16 +36,6 @@ public class FieldDef implements Serializable {
*/
private int length = -1;
private HashSet<String> filter;
private boolean sorted;
public boolean isSorted() {
return sorted;
}
private String clean;
public FieldDef() {
}
@ -102,30 +91,6 @@ public class FieldDef implements Serializable {
this.path = path;
}
public HashSet<String> getFilter() {
return filter;
}
public void setFilter(HashSet<String> filter) {
this.filter = filter;
}
public boolean getSorted() {
return sorted;
}
public void setSorted(boolean sorted) {
this.sorted = sorted;
}
public String getClean() {
return clean;
}
public void setClean(String clean) {
this.clean = clean;
}
@Override
public String toString() {
try {

View File

@ -5,9 +5,9 @@ import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
import org.apache.spark.SparkContext
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions.{col, desc, expr, lit, udf}
import org.apache.spark.sql.functions.{col, lit, udf}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Dataset, Row, SaveMode, functions}
import org.apache.spark.sql.{Column, Dataset, Row, functions}
import java.util.function.Predicate
import java.util.stream.Collectors
@ -80,8 +80,6 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
// .withColumn("count", functions.max("position").over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName)).rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) ))
// .filter("count > 1")
if (df_with_clustering_keys == null)
df_with_clustering_keys = ds
@ -90,44 +88,20 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
}
//TODO: analytics
/*df_with_clustering_keys.groupBy(col("clustering"), col("key"))
.agg(expr("max(count) AS size"))
.orderBy(desc("size"))
.show*/
val df_with_blocks = df_with_clustering_keys
// split the clustering block into smaller blocks of queuemaxsize
.groupBy(col("clustering"), col("key"), functions.floor(col("position").divide(lit(conf.getWf.getQueueMaxSize))))
// filter out rows with position exceeding the maxqueuesize parameter
.filter(col("position").leq(conf.getWf.getQueueMaxSize))
.groupBy("clustering", "key")
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1))
.union(
//adjacency blocks
df_with_clustering_keys
// filter out leading and trailing elements
.filter(col("position").gt(conf.getWf.getSlidingWindowSize/2))
//.filter(col("position").lt(col("count").minus(conf.getWf.getSlidingWindowSize/2)))
// create small blocks of records on "the border" of maxqueuesize: getSlidingWindowSize/2 elements before and after
.filter(
col("position").mod(conf.getWf.getQueueMaxSize).lt(conf.getWf.getSlidingWindowSize/2) // slice of the start of block
|| col("position").mod(conf.getWf.getQueueMaxSize).gt(conf.getWf.getQueueMaxSize - (conf.getWf.getSlidingWindowSize/2)) //slice of the end of the block
)
.groupBy(col("clustering"), col("key"), functions.floor((col("position") + lit(conf.getWf.getSlidingWindowSize/2)).divide(lit(conf.getWf.getQueueMaxSize))))
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1))
)
df_with_blocks
}
def clusterValuesUDF(cd: ClusteringDef) = {
udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
val valueList = values.flatMap {
case a: mutable.WrappedArray[Any] => a.map(_.toString)
case s: Any => Seq(s.toString)
}.asJava;
mutable.WrappedArray.make(cd.clusteringFunction().apply(conf, valueList).toArray())
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
})
}

View File

@ -1,16 +1,13 @@
package eu.dnetlib.pace.model
import com.jayway.jsonpath.{Configuration, JsonPath}
import eu.dnetlib.pace.common.AbstractPaceFunctions
import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.MapDocumentUtil
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import org.apache.spark.sql.{Dataset, Row}
import java.util.Locale
import java.util.regex.Pattern
import scala.collection.JavaConverters._
@ -63,7 +60,7 @@ case class SparkModel(conf: DedupConfig) {
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
schema.fieldNames.zipWithIndex.foldLeft(values) {
case ((res, (fname, index))) =>
case ((res, (fname, index))) => {
val fdef = conf.getPace.getModelMap.get(fname)
if (fdef != null) {
@ -99,52 +96,13 @@ case class SparkModel(conf: DedupConfig) {
case Type.DoubleArray =>
MapDocumentUtil.getJPathArray(fdef.getPath, json)
}
val filter = fdef.getFilter
if (StringUtils.isNotBlank(fdef.getClean)) {
res(index) = res(index) match {
case x: Seq[String] => x.map(clean(_, fdef.getClean)).toSeq
case _ => clean(res(index).toString, fdef.getClean)
}
}
if (filter != null && !filter.isEmpty) {
res(index) = res(index) match {
case x: String if filter.contains(x.toLowerCase(Locale.ROOT)) => null
case x: Seq[String] => x.filter(s => !filter.contains(s.toLowerCase(Locale.ROOT))).toSeq
case _ => res(index)
}
}
if (fdef.getSorted) {
res(index) = res(index) match {
case x: Seq[String] => x.sorted.toSeq
case _ => res(index)
}
}
}
res
}
}
new GenericRowWithSchema(values, schema)
}
def clean(value: String, cleantype: String) : String = {
val res = cleantype match {
case "title" => AbstractPaceFunctions.cleanup(value)
case _ => value
}
// if (!res.equals(AbstractPaceFunctions.normalize(value))) {
// println(res)
// println(AbstractPaceFunctions.normalize(value))
// println()
// }
res
}
}

View File

@ -23,6 +23,7 @@ public class AuthorsMatch extends AbstractListComparator {
private String MODE; // full or surname
private int SIZE_THRESHOLD;
private String TYPE; // count or percentage
private int common;
public AuthorsMatch(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
@ -34,6 +35,7 @@ public class AuthorsMatch extends AbstractListComparator {
FULLNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("fullname_th", "0.9"));
SIZE_THRESHOLD = Integer.parseInt(params.getOrDefault("size_th", "20"));
TYPE = params.getOrDefault("type", "percentage");
common = 0;
}
protected AuthorsMatch(double w, AbstractStringDistance ssalgo) {
@ -42,27 +44,22 @@ public class AuthorsMatch extends AbstractListComparator {
@Override
public double compare(final List<String> a, final List<String> b, final Config conf) {
if (a.isEmpty() || b.isEmpty())
return -1;
if (a.size() > SIZE_THRESHOLD || b.size() > SIZE_THRESHOLD)
return 1.0;
int maxMiss = Integer.MAX_VALUE;
List<Person> aList = a.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
List<Person> bList = b.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
Double threshold = getDoubleParam("threshold");
if (threshold != null && threshold >= 0.0 && threshold <= 1.0 && a.size() == b.size()) {
maxMiss = (int) Math.floor((1 - threshold) * Math.max(a.size(), b.size()));
}
int common = 0;
common = 0;
// compare each element of List1 with each element of List2
for (int i = 0; i < a.size(); i++) {
Person p1 = new Person(a.get(i), false);
for (Person p1 : aList)
for (Person p2 : bList) {
// both persons are inaccurate
if (!p1.isAccurate() && !p2.isAccurate()) {
// compare just normalized fullnames
@ -121,15 +118,11 @@ public class AuthorsMatch extends AbstractListComparator {
}
}
}
if (i - common > maxMiss) {
return 0.0;
}
}
// normalization factor to compute the score
int normFactor = a.size() == b.size() ? a.size() : (a.size() + b.size() - common);
int normFactor = aList.size() == bList.size() ? aList.size() : (aList.size() + bList.size() - common);
if (TYPE.equals("percentage")) {
return (double) common / normFactor;

View File

@ -25,7 +25,6 @@ public class InstanceTypeMatch extends AbstractListComparator {
translationMap.put("Conference object", "*");
translationMap.put("Other literature type", "*");
translationMap.put("Unknown", "*");
translationMap.put("UNKNOWN", "*");
// article types
translationMap.put("Article", "Article");
@ -77,4 +76,5 @@ public class InstanceTypeMatch extends AbstractListComparator {
protected double normalize(final double d) {
return d;
}
}

View File

@ -3,7 +3,6 @@ package eu.dnetlib.pace.tree;
import java.util.Map;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
@ -31,25 +30,16 @@ public class LevensteinTitle extends AbstractStringComparator {
}
@Override
public double distance(final String ca, final String cb, final Config conf) {
public double distance(final String a, final String b, final Config conf) {
final String ca = cleanup(a);
final String cb = cleanup(b);
final boolean check = checkNumbers(ca, cb);
if (check)
return 0.5;
Double threshold = getDoubleParam("threshold");
// reduce Levenshtein algo complexity when target threshold is known
if (threshold != null && threshold >= 0.0 && threshold <= 1.0) {
int maxdistance = (int) Math.floor((1 - threshold) * Math.max(ca.length(), cb.length()));
int score = StringUtils.getLevenshteinDistance(ca, cb, maxdistance);
if (score == -1) {
return 0;
}
return normalize(score, ca.length(), cb.length());
} else {
return normalize(StringUtils.getLevenshteinDistance(ca, cb), ca.length(), cb.length());
}
return normalize(ssalgo.score(ca, cb), ca.length(), cb.length());
}
private double normalize(final double score, final int la, final int lb) {

View File

@ -1,29 +0,0 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("maxLengthMatch")
public class MaxLengthMatch extends AbstractStringComparator {
private final int limit;
public MaxLengthMatch(Map<String, String> params) {
super(params);
limit = Integer.parseInt(params.getOrDefault("limit", "200"));
}
@Override
public double compare(String a, String b, final Config conf) {
return a.length() < limit && b.length() < limit ? 1.0 : -1.0;
}
protected String toString(final Object object) {
return toFirstString(object);
}
}

View File

@ -127,14 +127,4 @@ public abstract class AbstractComparator<T> extends AbstractPaceFunctions implem
return this.weight;
}
public Double getDoubleParam(String name) {
String svalue = params.get(name);
try {
return Double.parseDouble(svalue);
} catch (Throwable t) {
}
return null;
}
}

View File

@ -67,10 +67,8 @@ public class BlockProcessor {
private void processRows(final List<Row> queue, final Reporter context) {
IncrementalConnectedComponents icc = new IncrementalConnectedComponents(queue.size());
for (int i = 0; i < queue.size(); i++) {
final Row pivot = queue.get(i);
for (int pivotPos = 0; pivotPos < queue.size(); pivotPos++) {
final Row pivot = queue.get(pivotPos);
final String idPivot = pivot.getString(identifierFieldPos); // identifier
final Object fieldsPivot = getJavaValue(pivot, orderFieldPos);
@ -78,9 +76,9 @@ public class BlockProcessor {
final WfConfig wf = dedupConf.getWf();
if (fieldPivot != null) {
for (int j = icc.nextUnconnected(i, i + 1); j >= 0
&& j < queue.size(); j = icc.nextUnconnected(i, j + 1)) {
final Row curr = queue.get(j);
int i = 0;
for (int windowPos = pivotPos + 1; windowPos < queue.size(); windowPos++) {
final Row curr = queue.get(windowPos);
final String idCurr = curr.getString(identifierFieldPos); // identifier
if (mustSkip(idCurr)) {
@ -88,7 +86,7 @@ public class BlockProcessor {
break;
}
if (wf.getSlidingWindowSize() > 0 && (j - i) > wf.getSlidingWindowSize()) {
if (++i > wf.getSlidingWindowSize()) {
break;
}
@ -99,9 +97,7 @@ public class BlockProcessor {
final TreeProcessor treeProcessor = new TreeProcessor(dedupConf);
if (emitOutput(treeProcessor.compare(pivot, curr), idPivot, idCurr, context)) {
icc.connect(i, j);
}
emitOutput(treeProcessor.compare(pivot, curr), idPivot, idCurr, context);
}
}
}
@ -119,8 +115,7 @@ public class BlockProcessor {
return null;
}
private boolean emitOutput(final boolean result, final String idPivot, final String idCurr,
final Reporter context) {
private void emitOutput(final boolean result, final String idPivot, final String idCurr, final Reporter context) {
if (result) {
if (idPivot.compareTo(idCurr) <= 0) {
@ -132,8 +127,6 @@ public class BlockProcessor {
} else {
context.incrementCounter(dedupConf.getWf().getEntityType(), "d < " + dedupConf.getWf().getThreshold(), 1);
}
return result;
}
private boolean mustSkip(final String idPivot) {
@ -149,4 +142,5 @@ public class BlockProcessor {
context.emit(type, from, to);
}
}

View File

@ -1,50 +0,0 @@
package eu.dnetlib.pace.util;
import java.util.BitSet;
public class IncrementalConnectedComponents {
final private int size;
final private BitSet[] indexes;
IncrementalConnectedComponents(int size) {
this.size = size;
this.indexes = new BitSet[size];
}
public void connect(int i, int j) {
if (indexes[i] == null) {
if (indexes[j] == null) {
indexes[i] = new BitSet(size);
} else {
indexes[i] = indexes[j];
}
} else {
if (indexes[j] != null && indexes[i] != indexes[j]) {
// merge adjacency lists for i and j
indexes[i].or(indexes[j]);
}
}
indexes[i].set(i);
indexes[i].set(j);
indexes[j] = indexes[i];
}
public int nextUnconnected(int i, int j) {
if (indexes[i] == null) {
return j;
}
int result = indexes[i].nextClearBit(j);
return (result >= size) ? -1 : result;
}
public BitSet getConnections(int i) {
if (indexes[i] == null) {
return null;
}
return indexes[i];
}
}

View File

@ -97,8 +97,6 @@ public class MapDocumentUtil {
Object o = json.read(jsonPath);
if (o instanceof String)
return (String) o;
if (o instanceof Number)
return (String) o.toString();
if (o instanceof JSONArray && ((JSONArray) o).size() > 0)
return (String) ((JSONArray) o).get(0);
return "";

View File

@ -40,7 +40,7 @@ public class PaceResolver implements Serializable {
Collectors.toMap(cl -> cl.getAnnotation(ComparatorClass.class).value(), cl -> (Class<Comparator>) cl));
}
public ClusteringFunction getClusteringFunction(String name, Map<String, Object> params) throws PaceException {
public ClusteringFunction getClusteringFunction(String name, Map<String, Integer> params) throws PaceException {
try {
return clusteringFunctions.get(name).getDeclaredConstructor(Map.class).newInstance(params);
} catch (InstantiationException | IllegalAccessException | InvocationTargetException

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.DedupConfig;
public class ClusteringFunctionTest extends AbstractPaceTest {
private static Map<String, Object> params;
private static Map<String, Integer> params;
private static DedupConfig conf;
@BeforeAll
@ -40,10 +40,10 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testNgram() {
params.put("ngramLen", "3");
params.put("max", "8");
params.put("maxPerToken", "2");
params.put("minNgramLen", "1");
params.put("ngramLen", 3);
params.put("max", 8);
params.put("maxPerToken", 2);
params.put("minNgramLen", 1);
final ClusteringFunction ngram = new Ngrams(params);
@ -54,8 +54,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testNgramPairs() {
params.put("ngramLen", "3");
params.put("max", "2");
params.put("ngramLen", 3);
params.put("max", 2);
final ClusteringFunction np = new NgramPairs(params);
@ -66,8 +66,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testSortedNgramPairs() {
params.put("ngramLen", "3");
params.put("max", "2");
params.put("ngramLen", 3);
params.put("max", 2);
final ClusteringFunction np = new SortedNgramPairs(params);
@ -87,9 +87,9 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testAcronym() {
params.put("max", "4");
params.put("minLen", "1");
params.put("maxLen", "3");
params.put("max", 4);
params.put("minLen", 1);
params.put("maxLen", 3);
final ClusteringFunction acro = new Acronyms(params);
@ -100,8 +100,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testSuffixPrefix() {
params.put("len", "3");
params.put("max", "4");
params.put("len", 3);
params.put("max", 4);
final ClusteringFunction sp = new SuffixPrefix(params);
@ -109,8 +109,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
System.out.println(s);
System.out.println(sp.apply(conf, Lists.newArrayList(s)));
params.put("len", "3");
params.put("max", "1");
params.put("len", 3);
params.put("max", 1);
System.out.println(sp.apply(conf, Lists.newArrayList("Framework for general-purpose deduplication")));
}
@ -118,8 +118,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testWordsSuffixPrefix() {
params.put("len", "3");
params.put("max", "4");
params.put("len", 3);
params.put("max", 4);
final ClusteringFunction sp = new WordsSuffixPrefix(params);
@ -130,7 +130,7 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testWordsStatsSuffixPrefix() {
params.put("mod", "10");
params.put("mod", 10);
final ClusteringFunction sp = new WordsStatsSuffixPrefixChain(params);
@ -167,7 +167,7 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testFieldValue() {
params.put("randomLength", "5");
params.put("randomLength", 5);
final ClusteringFunction sp = new SpaceTrimmingFieldValue(params);

View File

@ -1,40 +0,0 @@
package eu.dnetlib.pace.util;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNull;
import org.junit.jupiter.api.Test;
public class IncrementalConnectedComponentsTest {
@Test
public void transitiveClosureTest() {
IncrementalConnectedComponents icc = new IncrementalConnectedComponents(10);
icc.connect(0, 1);
icc.connect(0, 2);
icc.connect(0, 3);
icc.connect(1, 2);
icc.connect(1, 4);
icc.connect(1, 5);
icc.connect(6, 7);
icc.connect(6, 9);
assertEquals(icc.getConnections(0).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(1).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(2).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(3).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(4).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(5).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(6).toString(), "{6, 7, 9}");
assertEquals(icc.getConnections(7).toString(), "{6, 7, 9}");
assertEquals(icc.getConnections(9).toString(), "{6, 7, 9}");
assertNull(icc.getConnections(8));
}
}

View File

@ -1,39 +0,0 @@
/*
* Copyright (c) 2024.
* SPDX-FileCopyrightText: © 2023 Consiglio Nazionale delle Ricerche
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
package eu.dnetlib.dhp.actionmanager.promote;
/** Encodes the Actionset promotion strategies */
public class PromoteAction {
/** The supported actionset promotion strategies
*
* ENRICH: promotes only records in the actionset matching another record in the
* graph and enriches them applying the given MergeAndGet strategy
* UPSERT: promotes all the records in an actionset, matching records are updated
* using the given MergeAndGet strategy, the non-matching record as inserted as they are.
*/
public enum Strategy {
ENRICH, UPSERT
}
/**
* Returns the string representation of the join type implementing the given PromoteAction.
*
* @param strategy the strategy to be used to promote the Actionset contents
* @return the join type used to implement the promotion strategy
*/
public static String joinTypeForStrategy(PromoteAction.Strategy strategy) {
switch (strategy) {
case ENRICH:
return "left_outer";
case UPSERT:
return "full_outer";
default:
throw new IllegalStateException("unsupported PromoteAction: " + strategy.toString());
}
}
}

View File

@ -67,9 +67,8 @@ public class PromoteActionPayloadForGraphTableJob {
String outputGraphTablePath = parser.get("outputGraphTablePath");
logger.info("outputGraphTablePath: {}", outputGraphTablePath);
MergeAndGet.Strategy mergeAndGetStrategy = MergeAndGet.Strategy
.valueOf(parser.get("mergeAndGetStrategy").toUpperCase());
logger.info("mergeAndGetStrategy: {}", mergeAndGetStrategy);
MergeAndGet.Strategy strategy = MergeAndGet.Strategy.valueOf(parser.get("mergeAndGetStrategy").toUpperCase());
logger.info("strategy: {}", strategy);
Boolean shouldGroupById = Optional
.ofNullable(parser.get("shouldGroupById"))
@ -77,12 +76,6 @@ public class PromoteActionPayloadForGraphTableJob {
.orElse(true);
logger.info("shouldGroupById: {}", shouldGroupById);
PromoteAction.Strategy promoteActionStrategy = Optional
.ofNullable(parser.get("promoteActionStrategy"))
.map(PromoteAction.Strategy::valueOf)
.orElse(PromoteAction.Strategy.UPSERT);
logger.info("promoteActionStrategy: {}", promoteActionStrategy);
@SuppressWarnings("unchecked")
Class<? extends Oaf> rowClazz = (Class<? extends Oaf>) Class.forName(graphTableClassName);
@SuppressWarnings("unchecked")
@ -104,8 +97,7 @@ public class PromoteActionPayloadForGraphTableJob {
inputGraphTablePath,
inputActionPayloadPath,
outputGraphTablePath,
mergeAndGetStrategy,
promoteActionStrategy,
strategy,
rowClazz,
actionPayloadClazz,
shouldGroupById);
@ -132,16 +124,14 @@ public class PromoteActionPayloadForGraphTableJob {
String inputGraphTablePath,
String inputActionPayloadPath,
String outputGraphTablePath,
MergeAndGet.Strategy mergeAndGetStrategy,
PromoteAction.Strategy promoteActionStrategy,
MergeAndGet.Strategy strategy,
Class<G> rowClazz,
Class<A> actionPayloadClazz, Boolean shouldGroupById) {
Dataset<G> rowDS = readGraphTable(spark, inputGraphTablePath, rowClazz);
Dataset<A> actionPayloadDS = readActionPayload(spark, inputActionPayloadPath, actionPayloadClazz);
Dataset<G> result = promoteActionPayloadForGraphTable(
rowDS, actionPayloadDS, mergeAndGetStrategy, promoteActionStrategy, rowClazz, actionPayloadClazz,
shouldGroupById)
rowDS, actionPayloadDS, strategy, rowClazz, actionPayloadClazz, shouldGroupById)
.map((MapFunction<G, G>) value -> value, Encoders.bean(rowClazz));
saveGraphTable(result, outputGraphTablePath);
@ -193,8 +183,7 @@ public class PromoteActionPayloadForGraphTableJob {
private static <G extends Oaf, A extends Oaf> Dataset<G> promoteActionPayloadForGraphTable(
Dataset<G> rowDS,
Dataset<A> actionPayloadDS,
MergeAndGet.Strategy mergeAndGetStrategy,
PromoteAction.Strategy promoteActionStrategy,
MergeAndGet.Strategy strategy,
Class<G> rowClazz,
Class<A> actionPayloadClazz,
Boolean shouldGroupById) {
@ -206,9 +195,8 @@ public class PromoteActionPayloadForGraphTableJob {
SerializableSupplier<Function<G, String>> rowIdFn = ModelSupport::idFn;
SerializableSupplier<Function<A, String>> actionPayloadIdFn = ModelSupport::idFn;
SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet
.functionFor(mergeAndGetStrategy);
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(mergeAndGetStrategy);
SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet.functionFor(strategy);
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(strategy);
SerializableSupplier<G> zeroFn = zeroFn(rowClazz);
SerializableSupplier<Function<G, Boolean>> isNotZeroFn = PromoteActionPayloadForGraphTableJob::isNotZeroFnUsingIdOrSourceAndTarget;
@ -219,7 +207,6 @@ public class PromoteActionPayloadForGraphTableJob {
rowIdFn,
actionPayloadIdFn,
mergeRowWithActionPayloadAndGetFn,
promoteActionStrategy,
rowClazz,
actionPayloadClazz);

View File

@ -34,7 +34,6 @@ public class PromoteActionPayloadFunctions {
* @param rowIdFn Function used to get the id of graph table row
* @param actionPayloadIdFn Function used to get id of action payload instance
* @param mergeAndGetFn Function used to merge graph table row and action payload instance
* @param promoteActionStrategy the Actionset promotion strategy
* @param rowClazz Class of graph table
* @param actionPayloadClazz Class of action payload
* @param <G> Type of graph table row
@ -47,7 +46,6 @@ public class PromoteActionPayloadFunctions {
SerializableSupplier<Function<G, String>> rowIdFn,
SerializableSupplier<Function<A, String>> actionPayloadIdFn,
SerializableSupplier<BiFunction<G, A, G>> mergeAndGetFn,
PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz,
Class<A> actionPayloadClazz) {
if (!isSubClass(rowClazz, actionPayloadClazz)) {
@ -63,7 +61,7 @@ public class PromoteActionPayloadFunctions {
.joinWith(
actionPayloadWithIdDS,
rowWithIdDS.col("_1").equalTo(actionPayloadWithIdDS.col("_1")),
PromoteAction.joinTypeForStrategy(promoteActionStrategy))
"full_outer")
.map(
(MapFunction<Tuple2<Tuple2<String, G>, Tuple2<String, A>>, G>) value -> {
Optional<G> rowOpt = Optional.ofNullable(value._1()).map(Tuple2::_2);

View File

@ -41,12 +41,6 @@
"paramDescription": "strategy for merging graph table objects with action payload instances, MERGE_FROM_AND_GET or SELECT_NEWER_AND_GET",
"paramRequired": true
},
{
"paramName": "pas",
"paramLongName": "promoteActionStrategy",
"paramDescription": "strategy for promoting the actionset contents into the graph tables, ENRICH or UPSERT (default)",
"paramRequired": false
},
{
"paramName": "sgid",
"paramLongName": "shouldGroupById",

View File

@ -115,7 +115,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/dataset</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForDatasetTable"/>
@ -168,7 +167,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/dataset</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -106,7 +106,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/datasource</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -106,7 +106,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/organization</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -114,7 +114,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/otherresearchproduct</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForOtherResearchProductTable"/>
@ -167,7 +166,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/otherresearchproduct</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -106,7 +106,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/project</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -115,7 +115,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/publication</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForPublicationTable"/>
@ -168,7 +167,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/publication</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -107,7 +107,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Relation</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/relation</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -114,7 +114,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/software</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForSoftwareTable"/>
@ -167,7 +166,6 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/software</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -54,7 +54,7 @@ public class PromoteActionPayloadFunctionsTest {
RuntimeException.class,
() -> PromoteActionPayloadFunctions
.joinGraphTableWithActionPayloadAndMerge(
null, null, null, null, null, null, OafImplSubSub.class, OafImpl.class));
null, null, null, null, null, OafImplSubSub.class, OafImpl.class));
}
@Test
@ -104,7 +104,6 @@ public class PromoteActionPayloadFunctionsTest {
rowIdFn,
actionPayloadIdFn,
mergeAndGetFn,
PromoteAction.Strategy.UPSERT,
OafImplSubSub.class,
OafImplSubSub.class)
.collectAsList();
@ -184,7 +183,6 @@ public class PromoteActionPayloadFunctionsTest {
rowIdFn,
actionPayloadIdFn,
mergeAndGetFn,
PromoteAction.Strategy.UPSERT,
OafImplSubSub.class,
OafImplSub.class)
.collectAsList();

View File

@ -124,19 +124,8 @@ public class PrepareFOSSparkJob implements Serializable {
FOSDataModel first) {
level1.add(first.getLevel1());
level2.add(first.getLevel2());
if (Optional.ofNullable(first.getLevel3()).isPresent() &&
!first.getLevel3().equalsIgnoreCase(NA) && !first.getLevel3().equalsIgnoreCase(NULL)
&& first.getLevel3() != null)
level3.add(first.getLevel3() + "@@" + first.getScoreL3());
else
level3.add(NULL);
if (Optional.ofNullable(first.getLevel4()).isPresent() &&
!first.getLevel4().equalsIgnoreCase(NA) &&
!first.getLevel4().equalsIgnoreCase(NULL) &&
first.getLevel4() != null)
level4.add(first.getLevel4() + "@@" + first.getScoreL4());
else
level4.add(NULL);
level3.add(first.getLevel3() + "@@" + first.getScoreL3());
level4.add(first.getLevel4() + "@@" + first.getScoreL4());
}
}

View File

@ -19,7 +19,6 @@ import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.aggregation.common.ReporterCallback;
import eu.dnetlib.dhp.aggregation.common.ReportingJob;
import eu.dnetlib.dhp.collection.plugin.CollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.base.BaseCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.file.FileCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.file.FileGZipCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.mongodb.MDStoreCollectorPlugin;
@ -121,8 +120,6 @@ public class CollectorWorker extends ReportingJob {
return new FileCollectorPlugin(fileSystem);
case fileGzip:
return new FileGZipCollectorPlugin(fileSystem);
case baseDump:
return new BaseCollectorPlugin(this.fileSystem);
case other:
final CollectorPlugin.NAME.OTHER_NAME plugin = Optional
.ofNullable(api.getParams().get("other_plugin_type"))

View File

@ -1,102 +0,0 @@
package eu.dnetlib.dhp.collection.orcid;
import static eu.dnetlib.dhp.utils.DHPUtils.getHadoopConfiguration;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.Objects;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class DownloadORCIDDumpApplication {
private static final Logger log = LoggerFactory.getLogger(DownloadORCIDDumpApplication.class);
private final FileSystem fileSystem;
public DownloadORCIDDumpApplication(FileSystem fileSystem) {
this.fileSystem = fileSystem;
}
public static void main(String[] args) throws Exception {
final ArgumentApplicationParser argumentParser = new ArgumentApplicationParser(
IOUtils
.toString(
Objects
.requireNonNull(
DownloadORCIDDumpApplication.class
.getResourceAsStream(
"/eu/dnetlib/dhp/collection/orcid/download_orcid_parameter.json"))));
argumentParser.parseArgument(args);
final String hdfsuri = argumentParser.get("namenode");
log.info("hdfsURI is {}", hdfsuri);
final String targetPath = argumentParser.get("targetPath");
log.info("targetPath is {}", targetPath);
final String apiURL = argumentParser.get("apiURL");
log.info("apiURL is {}", apiURL);
final FileSystem fileSystem = FileSystem.get(getHadoopConfiguration(hdfsuri));
new DownloadORCIDDumpApplication(fileSystem).run(targetPath, apiURL);
}
private void downloadItem(final String name, final String itemURL, final String basePath) {
try {
final Path hdfsWritePath = new Path(String.format("%s/%s", basePath, name));
final FSDataOutputStream fsDataOutputStream = fileSystem.create(hdfsWritePath, true);
final HttpGet request = new HttpGet(itemURL);
final int timeout = 60; // seconds
final RequestConfig config = RequestConfig
.custom()
.setConnectTimeout(timeout * 1000)
.setConnectionRequestTimeout(timeout * 1000)
.setSocketTimeout(timeout * 1000)
.build();
log.info("Downloading url {} into {}", itemURL, hdfsWritePath.getName());
try (CloseableHttpClient client = HttpClientBuilder.create().setDefaultRequestConfig(config).build();
CloseableHttpResponse response = client.execute(request)) {
int responseCode = response.getStatusLine().getStatusCode();
log.info("Response code is {}", responseCode);
if (responseCode >= 200 && responseCode < 400) {
IOUtils.copy(response.getEntity().getContent(), fsDataOutputStream);
}
} catch (Throwable eu) {
throw new RuntimeException(eu);
}
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
protected void run(final String targetPath, final String apiURL) throws Exception {
final ObjectMapper mapper = new ObjectMapper();
final URL url = new URL(apiURL);
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
final String json = IOUtils.toString(is);
JsonNode jsonNode = mapper.readTree(json);
jsonNode
.get("files")
.forEach(i -> downloadItem(i.get("name").asText(), i.get("download_url").asText(), targetPath));
}
}

View File

@ -1,71 +0,0 @@
package eu.dnetlib.dhp.collection.orcid;
import static eu.dnetlib.dhp.utils.DHPUtils.getHadoopConfiguration;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class ExtractORCIDDump {
private static final Logger log = LoggerFactory.getLogger(ExtractORCIDDump.class);
private final FileSystem fileSystem;
public ExtractORCIDDump(FileSystem fileSystem) {
this.fileSystem = fileSystem;
}
public static void main(String[] args) throws Exception {
final ArgumentApplicationParser argumentParser = new ArgumentApplicationParser(
IOUtils
.toString(
Objects
.requireNonNull(
DownloadORCIDDumpApplication.class
.getResourceAsStream(
"/eu/dnetlib/dhp/collection/orcid/extract_orcid_parameter.json"))));
argumentParser.parseArgument(args);
final String hdfsuri = argumentParser.get("namenode");
log.info("hdfsURI is {}", hdfsuri);
final String sourcePath = argumentParser.get("sourcePath");
log.info("sourcePath is {}", sourcePath);
final String targetPath = argumentParser.get("targetPath");
log.info("targetPath is {}", targetPath);
final FileSystem fileSystem = FileSystem.get(getHadoopConfiguration(hdfsuri));
new ExtractORCIDDump(fileSystem).run(sourcePath, targetPath);
}
public void run(final String sourcePath, final String targetPath) throws IOException, InterruptedException {
RemoteIterator<LocatedFileStatus> ls = fileSystem.listFiles(new Path(sourcePath), false);
final List<ORCIDExtractor> workers = new ArrayList<>();
int i = 0;
while (ls.hasNext()) {
LocatedFileStatus current = ls.next();
if (current.getPath().getName().endsWith("tar.gz")) {
workers.add(new ORCIDExtractor(fileSystem, "" + i++, current.getPath(), targetPath));
}
}
workers.forEach(Thread::start);
for (ORCIDExtractor worker : workers) {
worker.join();
}
}
}

View File

@ -1,171 +0,0 @@
package eu.dnetlib.dhp.collection.orcid;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**\
* The ORCIDExtractor class extracts ORCID data from a TAR archive.
* The class creates a map of SequenceFile.Writer objects, one for each type of data that is to be extracted (e.g., employments, works, summaries).
* Then, it iterates over the TAR archive and writes each entry to the appropriate SequenceFile.Writer object.
* Finally, it closes all the SequenceFile.Writer objects.
*/
public class ORCIDExtractor extends Thread {
private static final Logger log = LoggerFactory.getLogger(ORCIDExtractor.class);
private final FileSystem fileSystem;
private final String id;
private final Path sourcePath;
private final String baseOutputPath;
public ORCIDExtractor(FileSystem fileSystem, String id, Path sourcePath, String baseOutputPath) {
this.fileSystem = fileSystem;
this.id = id;
this.sourcePath = sourcePath;
this.baseOutputPath = baseOutputPath;
}
/**
* creates a map of SequenceFile.Writer objects,
* one for each type of data that is to be extracted. The map is created based on the filename in the TAR archive.
* For example, if the filename is employments.json, the map will contain an entry for the SequenceFile.Writer
* object that writes employment data.
* @return the Map
*/
private Map<String, SequenceFile.Writer> createMap() {
try {
log.info("Thread {} Creating sequence files starting from this input Path {}", id, sourcePath.getName());
Map<String, SequenceFile.Writer> res = new HashMap<>();
if (sourcePath.getName().contains("summaries")) {
final String summaryPath = String.format("%s/summaries_%s", baseOutputPath, id);
final SequenceFile.Writer summary_file = SequenceFile
.createWriter(
fileSystem.getConf(),
SequenceFile.Writer.file(new Path(summaryPath)),
SequenceFile.Writer.keyClass(Text.class),
SequenceFile.Writer.valueClass(Text.class));
log.info("Thread {} Creating only summary path here {}", id, summaryPath);
res.put("summary", summary_file);
return res;
} else {
String employmentsPath = String.format("%s/employments_%s", baseOutputPath, id);
final SequenceFile.Writer employments_file = SequenceFile
.createWriter(
fileSystem.getConf(),
SequenceFile.Writer.file(new Path(employmentsPath)),
SequenceFile.Writer.keyClass(Text.class),
SequenceFile.Writer.valueClass(Text.class));
res.put("employments", employments_file);
log.info("Thread {} Creating employments path here {}", id, employmentsPath);
final String worksPath = String.format("%s/works_%s", baseOutputPath, id);
final SequenceFile.Writer works_file = SequenceFile
.createWriter(
fileSystem.getConf(),
SequenceFile.Writer.file(new Path(worksPath)),
SequenceFile.Writer.keyClass(Text.class),
SequenceFile.Writer.valueClass(Text.class));
res.put("works", works_file);
log.info("Thread {} Creating works path here {}", id, worksPath);
return res;
}
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
@Override
public void run() {
CompressionCodecFactory factory = new CompressionCodecFactory(fileSystem.getConf());
CompressionCodec codec = factory.getCodec(sourcePath);
if (codec == null) {
System.err.println("No codec found for " + sourcePath.getName());
System.exit(1);
}
InputStream gzipInputStream = null;
try {
gzipInputStream = codec.createInputStream(fileSystem.open(sourcePath));
final Map<String, SequenceFile.Writer> fileMap = createMap();
iterateTar(fileMap, gzipInputStream);
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
log.info("Closing gzip stream");
IOUtils.closeStream(gzipInputStream);
}
}
private SequenceFile.Writer retrieveFile(Map<String, SequenceFile.Writer> fileMap, final String path) {
if (sourcePath.getName().contains("summaries")) {
return fileMap.get("summary");
}
if (path.contains("works")) {
return fileMap.get("works");
}
if (path.contains("employments"))
return fileMap.get("employments");
return null;
}
private void iterateTar(Map<String, SequenceFile.Writer> fileMap, InputStream gzipInputStream) throws IOException {
int extractedItem = 0;
try (final TarArchiveInputStream tais = new TarArchiveInputStream(gzipInputStream)) {
TarArchiveEntry entry;
while ((entry = tais.getNextTarEntry()) != null) {
if (entry.isFile()) {
final SequenceFile.Writer fl = retrieveFile(fileMap, entry.getName());
if (fl != null) {
final Text key = new Text(entry.getName());
final Text value = new Text(
org.apache.commons.io.IOUtils.toString(new BufferedReader(new InputStreamReader(tais))));
fl.append(key, value);
extractedItem++;
if (extractedItem % 100000 == 0) {
log.info("Thread {}: Extracted {} items", id, extractedItem);
break;
}
}
}
}
} finally {
for (SequenceFile.Writer k : fileMap.values()) {
log.info("Thread {}: Completed processed {} items", id, extractedItem);
k.hflush();
k.close();
}
}
}
}

View File

@ -1,251 +0,0 @@
package eu.dnetlib.dhp.collection.orcid;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.ximpleware.*;
import eu.dnetlib.dhp.collection.orcid.model.*;
import eu.dnetlib.dhp.parser.utility.VtdException;
import eu.dnetlib.dhp.parser.utility.VtdUtilityParser;
public class OrcidParser {
final Logger log = LoggerFactory.getLogger(OrcidParser.class);
private VTDNav vn;
private AutoPilot ap;
private static final String NS_COMMON_URL = "http://www.orcid.org/ns/common";
private static final String NS_COMMON = "common";
private static final String NS_PERSON_URL = "http://www.orcid.org/ns/person";
private static final String NS_PERSON = "person";
private static final String NS_DETAILS_URL = "http://www.orcid.org/ns/personal-details";
private static final String NS_DETAILS = "personal-details";
private static final String NS_OTHER_URL = "http://www.orcid.org/ns/other-name";
private static final String NS_OTHER = "other-name";
private static final String NS_RECORD_URL = "http://www.orcid.org/ns/record";
private static final String NS_RECORD = "record";
private static final String NS_ERROR_URL = "http://www.orcid.org/ns/error";
private static final String NS_ACTIVITIES = "activities";
private static final String NS_ACTIVITIES_URL = "http://www.orcid.org/ns/activities";
private static final String NS_WORK = "work";
private static final String NS_WORK_URL = "http://www.orcid.org/ns/work";
private static final String NS_ERROR = "error";
private static final String NS_HISTORY = "history";
private static final String NS_HISTORY_URL = "http://www.orcid.org/ns/history";
private static final String NS_BULK_URL = "http://www.orcid.org/ns/bulk";
private static final String NS_BULK = "bulk";
private static final String NS_EXTERNAL = "external-identifier";
private static final String NS_EXTERNAL_URL = "http://www.orcid.org/ns/external-identifier";
private void generateParsedDocument(final String xml) throws ParseException {
final VTDGen vg = new VTDGen();
vg.setDoc(xml.getBytes());
vg.parse(true);
this.vn = vg.getNav();
this.ap = new AutoPilot(vn);
ap.declareXPathNameSpace(NS_COMMON, NS_COMMON_URL);
ap.declareXPathNameSpace(NS_PERSON, NS_PERSON_URL);
ap.declareXPathNameSpace(NS_DETAILS, NS_DETAILS_URL);
ap.declareXPathNameSpace(NS_OTHER, NS_OTHER_URL);
ap.declareXPathNameSpace(NS_RECORD, NS_RECORD_URL);
ap.declareXPathNameSpace(NS_ERROR, NS_ERROR_URL);
ap.declareXPathNameSpace(NS_HISTORY, NS_HISTORY_URL);
ap.declareXPathNameSpace(NS_WORK, NS_WORK_URL);
ap.declareXPathNameSpace(NS_EXTERNAL, NS_EXTERNAL_URL);
ap.declareXPathNameSpace(NS_ACTIVITIES, NS_ACTIVITIES_URL);
}
public Author parseSummary(final String xml) {
try {
final Author author = new Author();
generateParsedDocument(xml);
List<VtdUtilityParser.Node> recordNodes = VtdUtilityParser
.getTextValuesWithAttributes(
ap, vn, "//record:record", Arrays.asList("path"));
if (!recordNodes.isEmpty()) {
final String oid = (recordNodes.get(0).getAttributes().get("path")).substring(1);
author.setOrcid(oid);
} else {
return null;
}
List<VtdUtilityParser.Node> personNodes = VtdUtilityParser
.getTextValuesWithAttributes(
ap, vn, "//person:name", Arrays.asList("visibility"));
final String visibility = (personNodes.get(0).getAttributes().get("visibility"));
author.setVisibility(visibility);
final String name = VtdUtilityParser.getSingleValue(ap, vn, "//personal-details:given-names");
author.setGivenName(name);
final String surnames = VtdUtilityParser.getSingleValue(ap, vn, "//personal-details:family-name");
author.setFamilyName(surnames);
final String creditNames = VtdUtilityParser.getSingleValue(ap, vn, "//personal-details:credit-name");
author.setCreditName(creditNames);
final String biography = VtdUtilityParser
.getSingleValue(ap, vn, "//person:biography/personal-details:content");
author.setBiography(biography);
final List<String> otherNames = VtdUtilityParser.getTextValue(ap, vn, "//other-name:content");
if (!otherNames.isEmpty()) {
author.setOtherNames(otherNames);
}
ap.selectXPath("//external-identifier:external-identifier");
while (ap.evalXPath() != -1) {
final Pid pid = new Pid();
final AutoPilot ap1 = new AutoPilot(ap.getNav());
ap1.selectXPath("./common:external-id-type");
while (ap1.evalXPath() != -1) {
int it = vn.getText();
pid.setSchema(vn.toNormalizedString(it));
}
ap1.selectXPath("./common:external-id-value");
while (ap1.evalXPath() != -1) {
int it = vn.getText();
pid.setValue(vn.toNormalizedString(it));
}
author.addOtherPid(pid);
}
return author;
} catch (Throwable e) {
log.error("Error on parsing {}", xml);
log.error(e.getMessage());
return null;
}
}
public Work parseWork(final String xml) {
try {
final Work work = new Work();
generateParsedDocument(xml);
List<VtdUtilityParser.Node> workNodes = VtdUtilityParser
.getTextValuesWithAttributes(ap, vn, "//work:work", Arrays.asList("path", "visibility"));
if (!workNodes.isEmpty()) {
final String oid = (workNodes.get(0).getAttributes().get("path")).split("/")[1];
work.setOrcid(oid);
} else {
return null;
}
ap.selectXPath("//common:external-id");
while (ap.evalXPath() != -1) {
final Pid pid = new Pid();
final AutoPilot ap1 = new AutoPilot(ap.getNav());
ap1.selectXPath("./common:external-id-type");
while (ap1.evalXPath() != -1) {
int it = vn.getText();
pid.setSchema(vn.toNormalizedString(it));
}
ap1.selectXPath("./common:external-id-value");
while (ap1.evalXPath() != -1) {
int it = vn.getText();
pid.setValue(vn.toNormalizedString(it));
}
work.addPid(pid);
}
work.setTitle(VtdUtilityParser.getSingleValue(ap, vn, "//work:title/common:title"));
return work;
} catch (Throwable e) {
log.error("Error on parsing {}", xml);
log.error(e.getMessage());
return null;
}
}
private String extractEmploymentDate(final String xpath) throws Exception {
ap.selectXPath(xpath);
StringBuilder sb = new StringBuilder();
while (ap.evalXPath() != -1) {
final AutoPilot ap1 = new AutoPilot(ap.getNav());
ap1.selectXPath("./common:year");
while (ap1.evalXPath() != -1) {
int it = vn.getText();
sb.append(vn.toNormalizedString(it));
}
ap1.selectXPath("./common:month");
while (ap1.evalXPath() != -1) {
int it = vn.getText();
sb.append("-");
sb.append(vn.toNormalizedString(it));
}
ap1.selectXPath("./common:day");
while (ap1.evalXPath() != -1) {
int it = vn.getText();
sb.append("-");
sb.append(vn.toNormalizedString(it));
}
}
return sb.toString();
}
public Employment parseEmployment(final String xml) {
try {
final Employment employment = new Employment();
generateParsedDocument(xml);
final String oid = VtdUtilityParser
.getSingleValue(ap, vn, "//common:source-orcid/common:path");
if (StringUtils.isNotBlank(oid)) {
employment.setOrcid(oid);
} else {
return null;
}
final String depName = VtdUtilityParser
.getSingleValue(ap, vn, "//common:department-name");
final String rolTitle = VtdUtilityParser
.getSingleValue(ap, vn, "//common:role-title");
if (StringUtils.isNotBlank(rolTitle))
employment.setRoleTitle(rolTitle);
if (StringUtils.isNotBlank(depName))
employment.setDepartmentName(depName);
else
employment
.setDepartmentName(
VtdUtilityParser
.getSingleValue(ap, vn, "//common:organization/common:name"));
employment.setStartDate(extractEmploymentDate("//common:start-date"));
employment.setEndDate(extractEmploymentDate("//common:end-date"));
final String affiliationId = VtdUtilityParser
.getSingleValue(ap, vn, "//common:disambiguated-organization-identifier");
final String affiliationIdType = VtdUtilityParser
.getSingleValue(ap, vn, "//common:disambiguation-source");
if (StringUtils.isNotBlank(affiliationId) || StringUtils.isNotBlank(affiliationIdType))
employment.setAffiliationId(new Pid(affiliationId, affiliationIdType));
return employment;
} catch (Throwable e) {
log.error("Error on parsing {}", xml);
log.error(e.getMessage());
return null;
}
}
}

View File

@ -1,83 +0,0 @@
package eu.dnetlib.dhp.collection.orcid.model;
import java.util.ArrayList;
import java.util.List;
public class Author extends ORCIDItem {
private String givenName;
private String familyName;
private String visibility;
private String creditName;
private List<String> otherNames;
private List<Pid> otherPids;
private String biography;
public String getBiography() {
return biography;
}
public void setBiography(String biography) {
this.biography = biography;
}
public String getGivenName() {
return givenName;
}
public void setGivenName(String givenName) {
this.givenName = givenName;
}
public String getFamilyName() {
return familyName;
}
public void setFamilyName(String familyName) {
this.familyName = familyName;
}
public String getCreditName() {
return creditName;
}
public void setCreditName(String creditName) {
this.creditName = creditName;
}
public List<String> getOtherNames() {
return otherNames;
}
public void setOtherNames(List<String> otherNames) {
this.otherNames = otherNames;
}
public String getVisibility() {
return visibility;
}
public void setVisibility(String visibility) {
this.visibility = visibility;
}
public List<Pid> getOtherPids() {
return otherPids;
}
public void setOtherPids(List<Pid> otherPids) {
this.otherPids = otherPids;
}
public void addOtherPid(final Pid pid) {
if (otherPids == null)
otherPids = new ArrayList<>();
otherPids.add(pid);
}
}

View File

@ -1,54 +0,0 @@
package eu.dnetlib.dhp.collection.orcid.model;
public class Employment extends ORCIDItem {
private String startDate;
private String EndDate;
private Pid affiliationId;
private String departmentName;
private String roleTitle;
public String getStartDate() {
return startDate;
}
public void setStartDate(String startDate) {
this.startDate = startDate;
}
public String getEndDate() {
return EndDate;
}
public void setEndDate(String endDate) {
EndDate = endDate;
}
public Pid getAffiliationId() {
return affiliationId;
}
public void setAffiliationId(Pid affiliationId) {
this.affiliationId = affiliationId;
}
public String getDepartmentName() {
return departmentName;
}
public void setDepartmentName(String departmentName) {
this.departmentName = departmentName;
}
public String getRoleTitle() {
return roleTitle;
}
public void setRoleTitle(String roleTitle) {
this.roleTitle = roleTitle;
}
}

View File

@ -1,14 +0,0 @@
package eu.dnetlib.dhp.collection.orcid.model;
public class ORCIDItem {
private String orcid;
public String getOrcid() {
return orcid;
}
public void setOrcid(String orcid) {
this.orcid = orcid;
}
}

View File

@ -1,33 +0,0 @@
package eu.dnetlib.dhp.collection.orcid.model;
public class Pid {
private String value;
private String schema;
public Pid() {
}
public Pid(String value, String schema) {
this.value = value;
this.schema = schema;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
public String getSchema() {
return schema;
}
public void setSchema(String schema) {
this.schema = schema;
}
}

View File

@ -1,35 +0,0 @@
package eu.dnetlib.dhp.collection.orcid.model;
import java.util.ArrayList;
import java.util.List;
public class Work extends ORCIDItem {
private String title;
private List<Pid> pids;
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public List<Pid> getPids() {
return pids;
}
public void setPids(List<Pid> pids) {
this.pids = pids;
}
public void addPid(Pid pid) {
if (pids == null)
pids = new ArrayList<>();
pids.add(pid);
}
}

View File

@ -10,8 +10,7 @@ import eu.dnetlib.dhp.common.collection.CollectorException;
public interface CollectorPlugin {
enum NAME {
oai, other, rest_json2xml, file, fileGzip, baseDump;
oai, other, rest_json2xml, file, fileGzip;
public enum OTHER_NAME {
mdstore_mongodb_dump, mdstore_mongodb

View File

@ -1,171 +0,0 @@
package eu.dnetlib.dhp.collection.plugin.base;
import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.StringWriter;
import java.util.Iterator;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
import org.apache.commons.compress.compressors.CompressorInputStream;
import org.apache.commons.compress.compressors.CompressorStreamFactory;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
public class BaseCollectorIterator implements Iterator<String> {
private String nextElement;
private final BlockingQueue<String> queue = new LinkedBlockingQueue<>(100);
private static final Logger log = LoggerFactory.getLogger(BaseCollectorIterator.class);
private static final String END_ELEM = "__END__";
public BaseCollectorIterator(final FileSystem fs, final Path filePath, final AggregatorReport report) {
new Thread(() -> importHadoopFile(fs, filePath, report)).start();
try {
this.nextElement = this.queue.take();
} catch (final InterruptedException e) {
throw new RuntimeException(e);
}
}
protected BaseCollectorIterator(final String resourcePath, final AggregatorReport report) {
new Thread(() -> importTestFile(resourcePath, report)).start();
try {
this.nextElement = this.queue.take();
} catch (final InterruptedException e) {
throw new RuntimeException(e);
}
}
@Override
public synchronized boolean hasNext() {
return (this.nextElement != null) & !END_ELEM.equals(this.nextElement);
}
@Override
public synchronized String next() {
try {
return END_ELEM.equals(this.nextElement) ? null : this.nextElement;
} finally {
try {
this.nextElement = this.queue.take();
} catch (final InterruptedException e) {
throw new RuntimeException(e);
}
}
}
private void importHadoopFile(final FileSystem fs, final Path filePath, final AggregatorReport report) {
log.info("I start to read the TAR stream");
try (InputStream origInputStream = fs.open(filePath);
final TarArchiveInputStream tarInputStream = new TarArchiveInputStream(origInputStream)) {
importTarStream(tarInputStream, report);
} catch (final Throwable e) {
throw new RuntimeException("Error processing BASE records", e);
}
}
private void importTestFile(final String resourcePath, final AggregatorReport report) {
try (final InputStream origInputStream = BaseCollectorIterator.class.getResourceAsStream(resourcePath);
final TarArchiveInputStream tarInputStream = new TarArchiveInputStream(origInputStream)) {
importTarStream(tarInputStream, report);
} catch (final Throwable e) {
throw new RuntimeException("Error processing BASE records", e);
}
}
private void importTarStream(final TarArchiveInputStream tarInputStream, final AggregatorReport report) {
long count = 0;
final XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
final XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
try {
TarArchiveEntry entry;
while ((entry = (TarArchiveEntry) tarInputStream.getNextEntry()) != null) {
final String name = entry.getName();
if (!entry.isDirectory() && name.contains("ListRecords") && name.endsWith(".bz2")) {
log.info("Processing file (BZIP): " + name);
final byte[] bzipData = new byte[(int) entry.getSize()];
IOUtils.readFully(tarInputStream, bzipData);
try (InputStream bzipIs = new ByteArrayInputStream(bzipData);
final BufferedInputStream bzipBis = new BufferedInputStream(bzipIs);
final CompressorInputStream bzipInput = new CompressorStreamFactory()
.createCompressorInputStream(bzipBis)) {
final XMLEventReader reader = xmlInputFactory.createXMLEventReader(bzipInput);
XMLEventWriter eventWriter = null;
StringWriter xmlWriter = null;
while (reader.hasNext()) {
final XMLEvent nextEvent = reader.nextEvent();
if (nextEvent.isStartElement()) {
final StartElement startElement = nextEvent.asStartElement();
if ("record".equals(startElement.getName().getLocalPart())) {
xmlWriter = new StringWriter();
eventWriter = xmlOutputFactory.createXMLEventWriter(xmlWriter);
}
}
if (eventWriter != null) {
eventWriter.add(nextEvent);
}
if (nextEvent.isEndElement()) {
final EndElement endElement = nextEvent.asEndElement();
if ("record".equals(endElement.getName().getLocalPart())) {
eventWriter.flush();
eventWriter.close();
this.queue.put(xmlWriter.toString());
eventWriter = null;
xmlWriter = null;
count++;
}
}
}
}
}
}
this.queue.put(END_ELEM); // TO INDICATE THE END OF THE QUEUE
} catch (final Throwable e) {
log.error("Error processing BASE records", e);
report.put(e.getClass().getName(), e.getMessage());
throw new RuntimeException("Error processing BASE records", e);
} finally {
log.info("Total records (written in queue): " + count);
}
}
}

View File

@ -1,159 +0,0 @@
package eu.dnetlib.dhp.collection.plugin.base;
import java.io.IOException;
import java.sql.SQLException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Optional;
import java.util.Set;
import java.util.Spliterator;
import java.util.Spliterators;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.collection.ApiDescriptor;
import eu.dnetlib.dhp.collection.plugin.CollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.file.AbstractSplittedRecordPlugin;
import eu.dnetlib.dhp.common.DbClient;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
import eu.dnetlib.dhp.common.collection.CollectorException;
public class BaseCollectorPlugin implements CollectorPlugin {
private final FileSystem fs;
private static final Logger log = LoggerFactory.getLogger(AbstractSplittedRecordPlugin.class);
// MAPPING AND FILTERING ARE DEFINED HERE:
// https://docs.google.com/document/d/1Aj-ZAV11b44MCrAAUCPiS2TUlXb6PnJEu1utCMAcCOU/edit
public BaseCollectorPlugin(final FileSystem fs) {
this.fs = fs;
}
@Override
public Stream<String> collect(final ApiDescriptor api, final AggregatorReport report) throws CollectorException {
// the path of the dump file on HDFS
// http://oai.base-search.net/initial_load/base_oaipmh_dump-current.tar
// it could be downloaded from iis-cdh5-test-gw.ocean.icm.edu.pl and then copied on HDFS
final Path filePath = Optional
.ofNullable(api.getBaseUrl())
.map(Path::new)
.orElseThrow(() -> new CollectorException("missing baseUrl"));
// get the parameters for the connection to the OpenAIRE database.
// the database is used to obtain the list of the datasources that the plugin will collect
final String dbUrl = api.getParams().get("dbUrl");
final String dbUser = api.getParams().get("dbUser");
final String dbPassword = api.getParams().get("dbPassword");
// the types(comma separated, empty value for all) that the plugin will collect,
// the types should be expressed in the format of the normalized types of BASE (for example 1,121,...)
final String acceptedNormTypesString = api.getParams().get("acceptedNormTypes");
log.info("baseUrl: {}", filePath);
log.info("dbUrl: {}", dbUrl);
log.info("dbUser: {}", dbUser);
log.info("dbPassword: {}", "***");
log.info("acceptedNormTypes: {}", acceptedNormTypesString);
try {
if (!this.fs.exists(filePath)) {
throw new CollectorException("path does not exist: " + filePath);
}
} catch (final Throwable e) {
throw new CollectorException(e);
}
final Set<String> acceptedOpendoarIds = findAcceptedOpendoarIds(dbUrl, dbUser, dbPassword);
final Set<String> acceptedNormTypes = new HashSet<>();
if (StringUtils.isNotBlank(acceptedNormTypesString)) {
for (final String s : StringUtils.split(acceptedNormTypesString, ",")) {
if (StringUtils.isNotBlank(s)) {
acceptedNormTypes.add(s.trim());
}
}
}
final Iterator<String> iterator = new BaseCollectorIterator(this.fs, filePath, report);
final Spliterator<String> spliterator = Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED);
return StreamSupport
.stream(spliterator, false)
.filter(doc -> filterXml(doc, acceptedOpendoarIds, acceptedNormTypes));
}
private Set<String> findAcceptedOpendoarIds(final String dbUrl, final String dbUser, final String dbPassword)
throws CollectorException {
final Set<String> accepted = new HashSet<>();
try (final DbClient dbClient = new DbClient(dbUrl, dbUser, dbPassword)) {
final String sql = IOUtils
.toString(
getClass().getResourceAsStream("/eu/dnetlib/dhp/collection/plugin/base/sql/opendoar-accepted.sql"));
dbClient.processResults(sql, row -> {
try {
final String dsId = row.getString("id");
log.info("Accepted Datasource: " + dsId);
accepted.add(dsId);
} catch (final SQLException e) {
log.error("Error in SQL", e);
throw new RuntimeException("Error in SQL", e);
}
});
} catch (final IOException e) {
log.error("Error accessong SQL", e);
throw new CollectorException("Error accessong SQL", e);
}
log.info("Accepted Datasources (TOTAL): " + accepted.size());
return accepted;
}
protected static boolean filterXml(final String xml,
final Set<String> acceptedOpendoarIds,
final Set<String> acceptedNormTypes) {
try {
final Document doc = DocumentHelper.parseText(xml);
final String id = doc.valueOf("//*[local-name()='collection']/@opendoar_id").trim();
if (StringUtils.isBlank(id) || !acceptedOpendoarIds.contains("opendoar____::" + id)) {
return false;
}
if (acceptedNormTypes.isEmpty()) {
return true;
}
for (final Object s : doc.selectNodes("//*[local-name()='typenorm']")) {
if (acceptedNormTypes.contains(((Node) s).getText().trim())) {
return true;
}
}
return false;
} catch (final DocumentException e) {
log.error("Error parsing document", e);
throw new RuntimeException("Error parsing document", e);
}
}
}

View File

@ -1,21 +0,0 @@
[
{
"paramName": "n",
"paramLongName": "namenode",
"paramDescription": "the Name Node URI",
"paramRequired": true
},
{
"paramName": "t",
"paramLongName": "targetPath",
"paramDescription": "the target PATH where download the files",
"paramRequired": true
},
{
"paramName": "a",
"paramLongName": "apiURL",
"paramDescription": "the FIGSHARE API id URL to retrieve all the dump files",
"paramRequired": true
}
]

View File

@ -1,21 +0,0 @@
[
{
"paramName": "n",
"paramLongName": "namenode",
"paramDescription": "the Name Node URI",
"paramRequired": true
},
{
"paramName": "t",
"paramLongName": "targetPath",
"paramDescription": "the target PATH to extract files",
"paramRequired": true
},
{
"paramName": "s",
"paramLongName": "sourcePath",
"paramDescription": "the PATH where the tar.gz files were downloaded",
"paramRequired": true
}
]

View File

@ -1,21 +0,0 @@
[
{
"paramName": "m",
"paramLongName": "master",
"paramDescription": "the master name",
"paramRequired": true
},
{
"paramName": "t",
"paramLongName": "targetPath",
"paramDescription": "the target PATH of the DF tables",
"paramRequired": true
},
{
"paramName": "s",
"paramLongName": "sourcePath",
"paramDescription": "the PATH of the ORCID sequence file",
"paramRequired": true
}
]

View File

@ -1,23 +0,0 @@
<configuration>
<property>
<name>jobTracker</name>
<value>yarnRM</value>
</property>
<property>
<name>nameNode</name>
<value>hdfs://nameservice1</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>oozie.action.sharelib.for.spark</name>
<value>spark2</value>
</property>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>

View File

@ -1,81 +0,0 @@
<workflow-app name="download_ORCID_DUMP" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>targetPath</name>
<description>the path to store the original ORCID dump</description>
</property>
<property>
<name>apiURL</name>
<description>The figshare API URL to retrieve the list file to download</description>
</property>
</parameters>
<start to="generateTables"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="DownloadDUMP">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.collection.orcid.DownloadORCIDDumpApplication</main-class>
<arg>--namenode</arg><arg>${nameNode}</arg>
<arg>--targetPath</arg><arg>${targetPath}</arg>
<arg>--apiURL</arg><arg>${apiURL}</arg>
</java>
<ok to="extractDump"/>
<error to="Kill"/>
</action>
<action name="extractDump">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.collection.orcid.ExtractORCIDDump</main-class>
<java-opts> -Xmx6g </java-opts>
<arg>--namenode</arg><arg>${nameNode}</arg>
<arg>--sourcePath</arg><arg>${targetPath}</arg>
<arg>--targetPath</arg><arg>${targetPath}/extracted</arg>
</java>
<ok to="generateTables"/>
<error to="Kill"/>
</action>
<action name="generateTables">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Generate ORCID Tables</name>
<class>eu.dnetlib.dhp.collection.orcid.SparkGenerateORCIDTable</class>
<jar>dhp-aggregation-${projectVersion}.jar</jar>
<spark-opts>
--executor-memory=${sparkExecutorMemory}
--executor-cores=${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.executor.memoryOverhead=2g
--conf spark.sql.shuffle.partitions=3000
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
</spark-opts>
<arg>--sourcePath</arg><arg>${targetPath}/extracted</arg>
<arg>--targetPath</arg><arg>${targetPath}/tables</arg>
<arg>--master</arg><arg>yarn</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

View File

@ -1,21 +0,0 @@
[
{
"paramName": "n",
"paramLongName": "namenode",
"paramDescription": "the Name Node URI",
"paramRequired": true
},
{
"paramName": "t",
"paramLongName": "targetPath",
"paramDescription": "the target PATH where download the files",
"paramRequired": true
},
{
"paramName": "a",
"paramLongName": "apiURL",
"paramDescription": "the FIGSHARE API id URL to retrieve all the dump files",
"paramRequired": true
}
]

View File

@ -1,114 +0,0 @@
BEGIN;
INSERT INTO dsm_services(
_dnet_resource_identifier_,
id,
officialname,
englishname,
namespaceprefix,
websiteurl,
logourl,
platform,
contactemail,
collectedfrom,
provenanceaction,
_typology_to_remove_,
eosc_type,
eosc_datasource_type,
research_entity_types,
thematic
) VALUES (
'openaire____::base_search',
'openaire____::base_search',
'Bielefeld Academic Search Engine (BASE)',
'Bielefeld Academic Search Engine (BASE)',
'base_search_',
'https://www.base-search.net',
'https://www.base-search.net/about/download/logo_224x57_white.gif',
'BASE',
'openaire-helpdesk@uni-bielefeld.de',
'infrastruct_::openaire',
'user:insert',
'aggregator::pubsrepository::unknown',
'Data Source',
'Aggregator',
ARRAY['Research Products'],
false
);
INSERT INTO dsm_service_organization(
_dnet_resource_identifier_,
organization,
service
) VALUES (
'fairsharing_::org::214@@openaire____::base_search',
'fairsharing_::org::214',
'openaire____::base_search'
);
INSERT INTO dsm_api(
_dnet_resource_identifier_,
id,
service,
protocol,
baseurl,
metadata_identifier_path
) VALUES (
'api_________::openaire____::base_search::dump',
'api_________::openaire____::base_search::dump',
'openaire____::base_search',
'baseDump',
'/user/michele.artini/base-import/base_oaipmh_dump-current.tar',
'//*[local-name()=''header'']/*[local-name()=''identifier'']'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@dbUrl',
'api_________::openaire____::base_search::dump',
'dbUrl',
'jdbc:postgresql://postgresql.services.openaire.eu:5432/dnet_openaireplus'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@dbUser',
'api_________::openaire____::base_search::dump',
'dbUser',
'dnet'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@dbPassword',
'api_________::openaire____::base_search::dump',
'dbPassword',
'***'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@acceptedNormTypes',
'api_________::openaire____::base_search::dump',
'acceptedNormTypes',
'1,11,111,121,13,14,15,18,181,182,183,1A,6,7'
);
COMMIT;

View File

@ -1,9 +0,0 @@
select s.id as id
from dsm_services s
where collectedfrom = 'openaire____::opendoar'
and jurisdiction = 'Institutional'
and s.id in (
select service from dsm_api where coalesce(compatibility_override, compatibility) = 'driver' or coalesce(compatibility_override, compatibility) = 'UNKNOWN'
) and s.id not in (
select service from dsm_api where coalesce(compatibility_override, compatibility) like '%openaire%'
);

View File

@ -1,11 +0,0 @@
select
s.id as id,
s.jurisdiction as jurisdiction,
array_remove(array_agg(a.id || ' (compliance: ' || coalesce(a.compatibility_override, a.compatibility, 'UNKNOWN') || ')@@@' || coalesce(a.last_collection_total, 0)), NULL) as aggregations
from
dsm_services s
join dsm_api a on (s.id = a.service)
where
collectedfrom = 'openaire____::opendoar'
group by
s.id;

View File

@ -1,180 +0,0 @@
<RESOURCE_PROFILE>
<HEADER>
<RESOURCE_IDENTIFIER value="c67911d6-9988-4a3b-b965-7d39bdd4a31d_Vm9jYWJ1bGFyeURTUmVzb3VyY2VzL1ZvY2FidWxhcnlEU1Jlc291cmNlVHlwZQ==" />
<RESOURCE_TYPE value="VocabularyDSResourceType" />
<RESOURCE_KIND value="VocabularyDSResources" />
<RESOURCE_URI value="" />
<DATE_OF_CREATION value="2024-02-13T11:15:48+00:00" />
</HEADER>
<BODY>
<CONFIGURATION>
<VOCABULARY_NAME code="base:normalized_types">base:normalized_types</VOCABULARY_NAME>
<VOCABULARY_DESCRIPTION>base:normalized_types</VOCABULARY_DESCRIPTION>
<TERMS>
<TERM native_name="Text" code="Text" english_name="Text" encoding="BASE">
<SYNONYMS>
<SYNONYM term="1" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Book" code="Book" english_name="Book" encoding="BASE">
<SYNONYMS>
<SYNONYM term="11" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Book part" code="Book part" english_name="Book part" encoding="BASE">
<SYNONYMS>
<SYNONYM term="111" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Journal/Newspaper" code="Journal/Newspaper" english_name="Journal/Newspaper" encoding="BASE">
<SYNONYMS>
<SYNONYM term="12" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Article contribution" code="Article contribution" english_name="Article contribution" encoding="BASE">
<SYNONYMS>
<SYNONYM term="121" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Other non-article" code="Other non-article" english_name="Other non-article" encoding="BASE">
<SYNONYMS>
<SYNONYM term="122" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Conference object" code="Conference object" english_name="Conference object" encoding="BASE">
<SYNONYMS>
<SYNONYM term="13" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Report" code="Report" english_name="Report" encoding="BASE">
<SYNONYMS>
<SYNONYM term="14" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Review" code="Review" english_name="Review" encoding="BASE">
<SYNONYMS>
<SYNONYM term="15" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Course material" code="Course material" english_name="Course material" encoding="BASE">
<SYNONYMS>
<SYNONYM term="16" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Lecture" code="Lecture" english_name="Lecture" encoding="BASE">
<SYNONYMS>
<SYNONYM term="17" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Thesis" code="Thesis" english_name="Thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="18" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Bachelor's thesis" code="Bachelor's thesis" english_name="Bachelor's thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="181" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Master's thesis" code="Master's thesis" english_name="Master's thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="182" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Doctoral and postdoctoral thesis" code="Doctoral and postdoctoral thesis" english_name="Doctoral and postdoctoral thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="183" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Manuscript" code="Manuscript" english_name="Manuscript" encoding="BASE">
<SYNONYMS>
<SYNONYM term="19" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Patent" code="Patent" english_name="Patent" encoding="BASE">
<SYNONYMS>
<SYNONYM term="1A" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Musical notation" code="Musical notation" english_name="Musical notation" encoding="BASE">
<SYNONYMS>
<SYNONYM term="2" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Map" code="Map" english_name="Map" encoding="BASE">
<SYNONYMS>
<SYNONYM term="3" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Audio" code="Audio" english_name="Audio" encoding="BASE">
<SYNONYMS>
<SYNONYM term="4" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Image/Video" code="Image/Video" english_name="Image/Video" encoding="BASE">
<SYNONYMS>
<SYNONYM term="5" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Still image" code="Still image" english_name="Still image" encoding="BASE">
<SYNONYMS>
<SYNONYM term="51" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Moving image/Video" code="Moving image/Video" english_name="Moving image/Video" encoding="BASE">
<SYNONYMS>
<SYNONYM term="52" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Software" code="Software" english_name="Software" encoding="BASE">
<SYNONYMS>
<SYNONYM term="6" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Dataset" code="Dataset" english_name="Dataset" encoding="BASE">
<SYNONYMS>
<SYNONYM term="7" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Unknown" code="Unknown" english_name="Unknown" encoding="BASE">
<SYNONYMS>
<SYNONYM term="F" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
</TERMS>
</CONFIGURATION>
<STATUS>
<LAST_UPDATE value="2013-11-18T10:46:36Z" />
</STATUS>
<SECURITY_PARAMETERS>String</SECURITY_PARAMETERS>
</BODY>
</RESOURCE_PROFILE>

View File

@ -1,298 +0,0 @@
<RESOURCE_PROFILE>
<HEADER>
<RESOURCE_IDENTIFIER value="" />
<RESOURCE_TYPE value="TransformationRuleDSResourceType" />
<RESOURCE_KIND value="TransformationRuleDSResources" />
<RESOURCE_URI value="" />
<DATE_OF_CREATION value="2024-03-05T11:23:00+00:00" />
</HEADER>
<BODY>
<CONFIGURATION>
<SOURCE_METADATA_FORMAT interpretation="cleaned" layout="store" name="dc" />
<SINK_METADATA_FORMAT name="oaf_hbase" />
<IMPORTED />
<SCRIPT>
<TITLE>xslt_base2oaf_hadoop</TITLE>
<CODE>
<xsl:stylesheet xmlns:oaire="http://namespace.openaire.eu/schema/oaire/" xmlns:dateCleaner="http://eu/dnetlib/transform/dateISO"
xmlns:base_dc="http://oai.base-search.net/base_dc/"
xmlns:datacite="http://datacite.org/schema/kernel-4" xmlns:dr="http://www.driver-repository.eu/namespace/dr" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:vocabulary="http://eu/dnetlib/transform/clean" xmlns:oaf="http://namespace.openaire.eu/oaf"
xmlns:oai="http://www.openarchives.org/OAI/2.0/" xmlns:dri="http://www.driver-repository.eu/namespace/dri" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/"
exclude-result-prefixes="xsl vocabulary dateCleaner base_dc" version="2.0">
<xsl:param name="varOfficialName" />
<xsl:param name="varDataSourceId" />
<xsl:param name="varFP7" select="'corda_______::'" />
<xsl:param name="varH2020" select="'corda__h2020::'" />
<xsl:param name="repoCode" select="substring-before(//*[local-name() = 'header']/*[local-name()='recordIdentifier'], ':')" />
<xsl:param name="index" select="0" />
<xsl:param name="transDate" select="current-dateTime()" />
<xsl:template name="terminate">
<xsl:message terminate="yes">
record is not compliant, transformation is interrupted.
</xsl:message>
</xsl:template>
<xsl:template match="/">
<record>
<xsl:apply-templates select="//*[local-name() = 'header']" />
<!-- TO EVALUATE
base_dc:authod_id
base_dc:authod_id/base_dc:creator_id
base_dc:authod_id/base_dc:creator_name
example:
<dc:creator>ALBU, Svetlana</dc:creator>
<base_dc:authod_id>
<base_dc:creator_name>ALBU, Svetlana</base_dc:creator_name>
<base_dc:creator_id>https://orcid.org/0000-0002-8648-950X</base_dc:creator_id>
</base_dc:authod_id>
-->
<!-- NOT USED
base_dc:global_id (I used oai:identifier)
base_dc:collection/text()
base_dc:continent
base_dc:country
base_dc:year (I used dc:date)
dc:coverage
dc:language (I used base_dc:lang)
base_dc:link (I used dc:identifier)
-->
<xsl:variable name="varBaseNormType" select="vocabulary:clean(//base_dc:typenorm, 'base:normalized_types')" />
<metadata>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:title" />
<xsl:with-param name="targetElement" select="'dc:title'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:creator/replace(., '^(.*)\|.*$', '$1')" />
<xsl:with-param name="targetElement" select="'dc:creator'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:contributor" />
<xsl:with-param name="targetElement" select="'dc:contributor'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:description" />
<xsl:with-param name="targetElement" select="'dc:description'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:subject" />
<xsl:with-param name="targetElement" select="'dc:subject'" />
</xsl:call-template>
<!-- TODO: I'm not sure if this is the correct encoding -->
<xsl:for-each select="//base_dc:classcode|//base_dc:autoclasscode">
<dc:subject><xsl:value-of select="concat(@type, ':', .)" /></dc:subject>
</xsl:for-each>
<!-- END TODO -->
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:publisher" />
<xsl:with-param name="targetElement" select="'dc:publisher'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:format" />
<xsl:with-param name="targetElement" select="'dc:format'" />
</xsl:call-template>
<dc:type>
<xsl:value-of select="$varBaseNormType" />
</dc:type>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:type" />
<xsl:with-param name="targetElement" select="'dc:type'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:source" />
<xsl:with-param name="targetElement" select="'dc:source'" />
</xsl:call-template>
<dc:language>
<xsl:value-of select="vocabulary:clean( //base_dc:lang, 'dnet:languages')" />
</dc:language>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:rights" />
<xsl:with-param name="targetElement" select="'dc:rights'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:relation" />
<xsl:with-param name="targetElement" select="'dc:relation'" />
</xsl:call-template>
<xsl:if test="not(//dc:identifier[starts-with(., 'http')])">
<xsl:call-template name="terminate" />
</xsl:if>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:identifier[starts-with(., 'http')]" />
<xsl:with-param name="targetElement" select="'dc:identifier'" />
</xsl:call-template>
<xsl:for-each select="//dc:relation">
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varFP7, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varH2020, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
</xsl:for-each>
<dr:CobjCategory>
<xsl:variable name="varCobjCategory" select="vocabulary:clean($varBaseNormType, 'dnet:publication_resource')" />
<xsl:variable name="varSuperType" select="vocabulary:clean($varCobjCategory, 'dnet:result_typologies')" />
<xsl:attribute name="type" select="$varSuperType" />
<xsl:value-of select="$varCobjCategory" />
</dr:CobjCategory>
<oaf:accessrights>
<xsl:choose>
<xsl:when test="//base_dc:oa[.='1']">OPEN</xsl:when>
<xsl:when test="//base_dc:rightsnorm">
<xsl:value-of select="vocabulary:clean(//base_dc:rightsnorm, 'dnet:access_modes')" />
</xsl:when>
<xsl:when test="//dc:rights">
<xsl:value-of select="vocabulary:clean( //dc:rights, 'dnet:access_modes')" />
</xsl:when>
<xsl:otherwise>UNKNOWN</xsl:otherwise>
</xsl:choose>
</oaf:accessrights>
<xsl:for-each select="//base_dc:doi">
<oaf:identifier identifierType="doi">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and (not(contains(., '://dx.doi.org/') or contains(., '://doi.org/') or contains(., '://hdl.handle.net/')))])">
<oaf:identifier identifierType="url">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and contains(., '://hdl.handle.net/')]/substring-after(., 'hdl.handle.net/'))">
<oaf:identifier identifierType="handle">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'urn:nbn:nl:') or starts-with(., 'URN:NBN:NL:')])">
<oaf:identifier identifierType='urn'>
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<oaf:identifier identifierType="oai-original">
<xsl:value-of
select="//*[local-name() = 'about']/*[local-name() = 'provenance']//*[local-name() = 'originDescription' and not(./*[local-name() = 'originDescription'])]/*[local-name() = 'identifier']" />
</oaf:identifier>
<oaf:hostedBy>
<xsl:attribute name="name">
<xsl:value-of select="//base_dc:collname" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="concat('opendoar____::', //base_dc:collection/@opendoar_id)" />
</xsl:attribute>
</oaf:hostedBy>
<oaf:collectedFrom>
<xsl:attribute name="name">
<xsl:value-of select="$varOfficialName" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="$varDataSourceId" />
</xsl:attribute>
</oaf:collectedFrom>
<oaf:dateAccepted>
<xsl:value-of select="dateCleaner:dateISO( //dc:date[1] )" />
</oaf:dateAccepted>
<xsl:if test="//base_dc:oa[.='1']">
<xsl:for-each select="//dc:relation[starts-with(., 'http')]">
<oaf:fulltext>
<xsl:value-of select="normalize-space(.)" />
</oaf:fulltext>
</xsl:for-each>
</xsl:if>
<xsl:for-each select="//base_dc:collection/@ror_id">
<oaf:relation relType="resultOrganization"
subRelType="affiliation"
relClass="hasAuthorInstitution"
targetType="organization">
<xsl:choose>
<xsl:when test="contains(.,'https://ror.org/')">
<xsl:value-of select="concat('ror_________::', normalize-space(.))" />
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="concat('ror_________::https://ror.org/', normalize-space(.))" />
</xsl:otherwise>
</xsl:choose>
</oaf:relation>
</xsl:for-each>
</metadata>
<xsl:copy-of select="//*[local-name() = 'about']" />
</record>
</xsl:template>
<xsl:template name="allElements">
<xsl:param name="sourceElement" />
<xsl:param name="targetElement" />
<xsl:for-each select="$sourceElement">
<xsl:element name="{$targetElement}">
<xsl:value-of select="normalize-space(.)" />
</xsl:element>
</xsl:for-each>
</xsl:template>
<xsl:template match="//*[local-name() = 'header']">
<xsl:if test="//oai:header/@status='deleted'">
<xsl:call-template name="terminate" />
</xsl:if>
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
<xsl:element name="dr:dateOfTransformation">
<xsl:value-of select="$transDate" />
</xsl:element>
</xsl:copy>
</xsl:template>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
</CODE>
</SCRIPT>
</CONFIGURATION>
<STATUS />
<SECURITY_PARAMETERS />
</BODY>
</RESOURCE_PROFILE>

View File

@ -1,322 +0,0 @@
<RESOURCE_PROFILE>
<HEADER>
<RESOURCE_IDENTIFIER value="2ad0cdd9-c96c-484c-8b0e-ed56d86891fe_VHJhbnNmb3JtYXRpb25SdWxlRFNSZXNvdXJjZXMvVHJhbnNmb3JtYXRpb25SdWxlRFNSZXNvdXJjZVR5cGU=" />
<RESOURCE_TYPE value="TransformationRuleDSResourceType" />
<RESOURCE_KIND value="TransformationRuleDSResources" />
<RESOURCE_URI value="" />
<DATE_OF_CREATION value="2024-03-05T11:23:00+00:00" />
</HEADER>
<BODY>
<CONFIGURATION>
<SOURCE_METADATA_FORMAT interpretation="cleaned" layout="store" name="dc" />
<SINK_METADATA_FORMAT name="odf_hbase" />
<IMPORTED />
<SCRIPT>
<TITLE>xslt_base2odf_hadoop</TITLE>
<CODE>
<xsl:stylesheet xmlns:oaire="http://namespace.openaire.eu/schema/oaire/" xmlns:dateCleaner="http://eu/dnetlib/transform/dateISO" xmlns:base_dc="http://oai.base-search.net/base_dc/"
xmlns:datacite="http://datacite.org/schema/kernel-4" xmlns:dr="http://www.driver-repository.eu/namespace/dr" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:vocabulary="http://eu/dnetlib/transform/clean" xmlns:oaf="http://namespace.openaire.eu/oaf"
xmlns:oai="http://www.openarchives.org/OAI/2.0/" xmlns:dri="http://www.driver-repository.eu/namespace/dri" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/"
exclude-result-prefixes="xsl vocabulary dateCleaner base_dc" version="2.0">
<xsl:param name="varOfficialName" />
<xsl:param name="varDataSourceId" />
<xsl:param name="varFP7" select="'corda_______::'" />
<xsl:param name="varH2020" select="'corda__h2020::'" />
<xsl:param name="repoCode" select="substring-before(//*[local-name() = 'header']/*[local-name()='recordIdentifier'], ':')" />
<xsl:param name="index" select="0" />
<xsl:param name="transDate" select="current-dateTime()" />
<xsl:template name="terminate">
<xsl:message terminate="yes">
record is not compliant, transformation is interrupted.
</xsl:message>
</xsl:template>
<xsl:template match="/">
<record>
<xsl:apply-templates select="//*[local-name() = 'header']" />
<!-- NOT USED
base_dc:global_id (I used oai:identifier)
base_dc:collection/text()
base_dc:continent
base_dc:country
dc:coverage
dc:source
dc:relation
dc:type (I used //base_dc:typenorm)
dc:language (I used base_dc:lang)
base_dc:link (I used dc:identifier)
-->
<xsl:variable name="varBaseNormType" select="vocabulary:clean(//base_dc:typenorm, 'base:normalized_types')" />
<metadata>
<datacite:resource>
<xsl:for-each select="//base_dc:doi">
<datacite:identifier identifierType="DOI">
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<datacite:alternateIdentifiers>
<xsl:for-each
select="distinct-values(//dc:identifier[starts-with(., 'http') and (not(contains(., '://dx.doi.org/') or contains(., '://doi.org/') or contains(., '://hdl.handle.net/')))])">
<datacite:identifier alternateIdentifierType="url">
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and contains(., '://hdl.handle.net/')]/substring-after(., 'hdl.handle.net/'))">
<datacite:identifier alternateIdentifierType="handle">
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'urn:nbn:nl:') or starts-with(., 'URN:NBN:NL:')])">
<datacite:identifier alternateIdentifierType='urn'>
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<datacite:identifier alternateIdentifierType="oai-original">
<xsl:value-of
select="//*[local-name() = 'about']/*[local-name() = 'provenance']//*[local-name() = 'originDescription' and not(./*[local-name() = 'originDescription'])]/*[local-name() = 'identifier']" />
</datacite:identifier>
</datacite:alternateIdentifiers>
<datacite:relatedIdentifiers />
<datacite:resourceType><xsl:value-of select="$varBaseNormType" /></datacite:resourceType>
<datacite:titles>
<xsl:for-each select="//dc:title">
<datacite:title>
<xsl:value-of select="normalize-space(.)" />
</datacite:title>
</xsl:for-each>
</datacite:titles>
<datacite:creators>
<xsl:for-each select="//dc:creator">
<xsl:variable name="author" select="normalize-space(.)" />
<datacite:creator>
<datacite:creatorName>
<xsl:value-of select="$author" />
</datacite:creatorName>
<xsl:for-each select="//base_dc:authod_id[normalize-space(./base_dc:creator_name) = $author]/base_dc:creator_id ">
<xsl:if test="contains(.,'https://orcid.org/')">
<nameIdentifier schemeURI="https://orcid.org/" nameIdentifierScheme="ORCID">
<xsl:value-of select="substring-after(., 'https://orcid.org/')" />
</nameIdentifier>
</xsl:if>
</xsl:for-each>
</datacite:creator>
</xsl:for-each>
</datacite:creators>
<datacite:contributors>
<xsl:for-each select="//dc:contributor">
<datacite:contributor>
<datacite:contributorName>
<xsl:value-of select="normalize-space(.)" />
</datacite:contributorName>
</datacite:contributor>
</xsl:for-each>
</datacite:contributors>
<datacite:descriptions>
<xsl:for-each select="//dc:description">
<datacite:description descriptionType="Abstract">
<xsl:value-of select="normalize-space(.)" />
</datacite:description>
</xsl:for-each>
</datacite:descriptions>
<datacite:subjects>
<xsl:for-each select="//dc:subject">
<datacite:subject>
<xsl:value-of select="normalize-space(.)" />
</datacite:subject>
</xsl:for-each>
<xsl:for-each select="//base_dc:classcode|//base_dc:autoclasscode">
<datacite:subject subjectScheme="{@type}" classificationCode="{normalize-space(.)}">
<!-- TODO the value should be obtained by the Code -->
<xsl:value-of select="normalize-space(.)" />
</datacite:subject>
</xsl:for-each>
</datacite:subjects>
<datacite:publisher>
<xsl:value-of select="normalize-space(//dc:publisher)" />
</datacite:publisher>
<datacite:publicationYear>
<xsl:value-of select="normalize-space(//base_dc:year)" />
</datacite:publicationYear>
<datacite:formats>
<xsl:for-each select="//dc:format">
<datacite:format>
<xsl:value-of select="normalize-space(.)" />
</datacite:format>
</xsl:for-each>
</datacite:formats>
<datacite:language>
<xsl:value-of select="vocabulary:clean( //base_dc:lang, 'dnet:languages')" />
</datacite:language>
<oaf:accessrights>
<xsl:if test="//base_dc:oa[.='1']">
<datacite:rights rightsURI="http://purl.org/coar/access_right/c_abf2">open access</datacite:rights>
</xsl:if>
<xsl:for-each select="//dc:rights|//base_dc:rightsnorm">
<datacite:rights><xsl:value-of select="vocabulary:clean(., 'dnet:access_modes')" /></datacite:rights>
</xsl:for-each>
</oaf:accessrights>
</datacite:resource>
<xsl:for-each select="//dc:relation">
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varFP7, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varH2020, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
</xsl:for-each>
<dr:CobjCategory>
<xsl:variable name="varCobjCategory" select="vocabulary:clean($varBaseNormType, 'dnet:publication_resource')" />
<xsl:variable name="varSuperType" select="vocabulary:clean($varCobjCategory, 'dnet:result_typologies')" />
<xsl:attribute name="type" select="$varSuperType" />
<xsl:value-of select="$varCobjCategory" />
</dr:CobjCategory>
<oaf:accessrights>
<xsl:choose>
<xsl:when test="//base_dc:oa[.='1']">OPEN</xsl:when>
<xsl:when test="//base_dc:rightsnorm">
<xsl:value-of select="vocabulary:clean(//base_dc:rightsnorm, 'dnet:access_modes')" />
</xsl:when>
<xsl:when test="//dc:rights">
<xsl:value-of select="vocabulary:clean( //dc:rights, 'dnet:access_modes')" />
</xsl:when>
<xsl:otherwise>UNKNOWN</xsl:otherwise>
</xsl:choose>
</oaf:accessrights>
<xsl:for-each select="//base_dc:doi">
<oaf:identifier identifierType="doi">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each
select="distinct-values(//dc:identifier[starts-with(., 'http') and ( not(contains(., '://dx.doi.org/') or contains(., '://doi.org/') or contains(., '://hdl.handle.net/')))])">
<oaf:identifier identifierType="url">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and contains(., '://hdl.handle.net/')]/substring-after(., 'hdl.handle.net/'))">
<oaf:identifier identifierType="handle">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'urn:nbn:nl:') or starts-with(., 'URN:NBN:NL:')])">
<oaf:identifier identifierType='urn'>
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<oaf:identifier identifierType="oai-original">
<xsl:value-of
select="//*[local-name() = 'about']/*[local-name() = 'provenance']//*[local-name() = 'originDescription' and not(./*[local-name() = 'originDescription'])]/*[local-name() = 'identifier']" />
</oaf:identifier>
<oaf:hostedBy>
<xsl:attribute name="name">
<xsl:value-of select="//base_dc:collname" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="concat('opendoar____::', //base_dc:collection/@opendoar_id)" />
</xsl:attribute>
</oaf:hostedBy>
<oaf:collectedFrom>
<xsl:attribute name="name">
<xsl:value-of select="$varOfficialName" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="$varDataSourceId" />
</xsl:attribute>
</oaf:collectedFrom>
<oaf:dateAccepted>
<xsl:value-of select="dateCleaner:dateISO( //dc:date[1] )" />
</oaf:dateAccepted>
<xsl:if test="//base_dc:oa[.='1']">
<xsl:for-each select="//dc:relation[starts-with(., 'http')]">
<oaf:fulltext>
<xsl:value-of select="normalize-space(.)" />
</oaf:fulltext>
</xsl:for-each>
</xsl:if>
<xsl:for-each select="//base_dc:collection/@ror_id">
<oaf:relation relType="resultOrganization" subRelType="affiliation" relClass="hasAuthorInstitution" targetType="organization">
<xsl:choose>
<xsl:when test="contains(.,'https://ror.org/')">
<xsl:value-of select="concat('ror_________::', normalize-space(.))" />
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="concat('ror_________::https://ror.org/', normalize-space(.))" />
</xsl:otherwise>
</xsl:choose>
</oaf:relation>
</xsl:for-each>
</metadata>
<xsl:copy-of select="//*[local-name() = 'about']" />
</record>
</xsl:template>
<xsl:template match="//*[local-name() = 'header']">
<xsl:if test="//oai:header/@status='deleted'">
<xsl:call-template name="terminate" />
</xsl:if>
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
<xsl:element name="dr:dateOfTransformation">
<xsl:value-of select="$transDate" />
</xsl:element>
</xsl:copy>
</xsl:template>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
</CODE>
</SCRIPT>
</CONFIGURATION>
<STATUS />
<SECURITY_PARAMETERS />
</BODY>
</RESOURCE_PROFILE>

View File

@ -1048,10 +1048,5 @@
"openaire_id": "re3data_____::r3d100010399",
"datacite_name": "ZEW Forschungsdatenzentrum",
"official_name": "ZEW Forschungsdatenzentrum"
},
"HBP.NEUROINF": {
"openaire_id": "fairsharing_::2975",
"datacite_name": "EBRAINS",
"official_name": "EBRAINS"
}
}

View File

@ -1,4 +1,4 @@
<workflow-app name="Transform_BioEntity_Workflow" xmlns="uri:oozie:workflow:0.5">
<workflow-app name="Transform_BioEntity_Workflow" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>sourcePath</name>
@ -8,40 +8,19 @@
<name>database</name>
<description>the PDB Database Working Path</description>
</property>
<property>
<name>mdStoreOutputId</name>
<description>the identifier of the cleaned MDStore</description>
</property>
<property>
<name>mdStoreManagerURI</name>
<description>the path of the cleaned mdstore</description>
<name>targetPath</name>
<description>the Target Working dir path</description>
</property>
</parameters>
<start to="StartTransaction"/>
<start to="ConvertDB"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="StartTransaction">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>NEW_VERSION</arg>
<arg>--mdStoreID</arg><arg>${mdStoreOutputId}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
<capture-output/>
</java>
<ok to="ConvertDB"/>
<error to="RollBack"/>
</action>
<action name="ConvertDB">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
@ -62,48 +41,11 @@
<arg>--master</arg><arg>yarn</arg>
<arg>--dbPath</arg><arg>${sourcePath}</arg>
<arg>--database</arg><arg>${database}</arg>
<arg>--mdstoreOutputVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--targetPath</arg><arg>${targetPath}</arg>
</spark>
<ok to="CommitVersion"/>
<error to="RollBack"/>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="CommitVersion">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>COMMIT</arg>
<arg>--namenode</arg><arg>${nameNode}</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="RollBack">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>ROLLBACK</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="Kill"/>
<error to="Kill"/>
</action>
<end name="End"/>
<end name="End"/>
</workflow-app>

Some files were not shown because too many files have changed in this diff Show More