1
0
Fork 0

merged beta

This commit is contained in:
Sandro La Bruzzo 2024-04-19 14:39:59 +02:00
commit b84ad0c06e
470 changed files with 140235 additions and 6184 deletions

43
CODE_OF_CONDUCT.md Normal file
View File

@ -0,0 +1,43 @@
# Contributor Code of Conduct
Openness, transparency and our community-driven participatory approach guide us in our day-to-day interactions and decision-making. Our open source projects are no exception. Trust, respect, collaboration and transparency are core values we believe should live and breathe within our projects. Our community welcomes participants from around the world with different experiences, unique perspectives, and great ideas to share.
## Our Pledge
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment include:
- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Attempting collaboration before conflict
- Focusing on what is best for the community
- Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
- Violence, threats of violence, or inciting others to commit self-harm
- The use of sexualized language or imagery and unwelcome sexual attention or advances
- Trolling, intentionally spreading misinformation, insulting/derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or electronic address, without explicit permission
- Abuse of the reporting process to intentionally harass or exclude others
- Advocating for, or encouraging, any of the above behavior
- Other conduct which could reasonably be considered inappropriate in a professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), [version 1.4](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html).

10
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,10 @@
# Contributing to D-Net Hadoop
:+1::tada: First off, thanks for taking the time to contribute! :tada::+1:
This project and everyone participating in it is governed by our [Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
The following is a set of guidelines for contributing to this project and its packages. These are mostly guidelines, not rules, which applies to this project as a while, including all its sub-modules.
Use your best judgment, and feel free to propose changes to this document in a pull request.
All contributions are welcome, all contributions will be considered to be contributed under the [project license](LICENSE.md).

View File

View File

@ -2,6 +2,11 @@
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning. Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
This project adheres to the Contributor Covenant [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
This project is licensed under the [AGPL v3 or later version](#LICENSE.md).
How to build, package and run oozie workflows How to build, package and run oozie workflows
==================== ====================

View File

@ -217,8 +217,6 @@ public class ZenodoAPIClient implements Serializable {
* part of the url for the DOI Zenodo suggests to use to cite all versions: DOI: 10.xxx/zenodo.656930 * part of the url for the DOI Zenodo suggests to use to cite all versions: DOI: 10.xxx/zenodo.656930
* concept_rec_id = 656930 * concept_rec_id = 656930
* @return response code * @return response code
* @throws IOException
* @throws MissingConceptDoiException
*/ */
public int newVersion(String concept_rec_id) throws IOException, MissingConceptDoiException { public int newVersion(String concept_rec_id) throws IOException, MissingConceptDoiException {
setDepositionId(concept_rec_id, 1); setDepositionId(concept_rec_id, 1);

View File

@ -0,0 +1,39 @@
package eu.dnetlib.dhp.common.api.context;
public class CategorySummary {
private String id;
private String label;
private boolean hasConcept;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public boolean isHasConcept() {
return hasConcept;
}
public CategorySummary setId(final String id) {
this.id = id;
return this;
}
public CategorySummary setLabel(final String label) {
this.label = label;
return this;
}
public CategorySummary setHasConcept(final boolean hasConcept) {
this.hasConcept = hasConcept;
return this;
}
}

View File

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class CategorySummaryList extends ArrayList<CategorySummary> {
}

View File

@ -0,0 +1,52 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.List;
public class ConceptSummary {
private String id;
private String label;
public boolean hasSubConcept;
private List<ConceptSummary> concepts;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public List<ConceptSummary> getConcepts() {
return concepts;
}
public ConceptSummary setId(final String id) {
this.id = id;
return this;
}
public ConceptSummary setLabel(final String label) {
this.label = label;
return this;
}
public boolean isHasSubConcept() {
return hasSubConcept;
}
public ConceptSummary setHasSubConcept(final boolean hasSubConcept) {
this.hasSubConcept = hasSubConcept;
return this;
}
public ConceptSummary setConcept(final List<ConceptSummary> concepts) {
this.concepts = concepts;
return this;
}
}

View File

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ConceptSummaryList extends ArrayList<ConceptSummary> {
}

View File

@ -0,0 +1,50 @@
package eu.dnetlib.dhp.common.api.context;
public class ContextSummary {
private String id;
private String label;
private String type;
private String status;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public String getType() {
return type;
}
public String getStatus() {
return status;
}
public ContextSummary setId(final String id) {
this.id = id;
return this;
}
public ContextSummary setLabel(final String label) {
this.label = label;
return this;
}
public ContextSummary setType(final String type) {
this.type = type;
return this;
}
public ContextSummary setStatus(final String status) {
this.status = status;
return this;
}
}

View File

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ContextSummaryList extends ArrayList<ContextSummary> {
}

View File

@ -8,6 +8,7 @@ import java.io.InputStream;
import java.net.*; import java.net.*;
import java.util.List; import java.util.List;
import java.util.Map; import java.util.Map;
import java.util.concurrent.TimeUnit;
import org.apache.commons.io.IOUtils; import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.math.NumberUtils; import org.apache.commons.lang3.math.NumberUtils;
@ -94,14 +95,16 @@ public class HttpConnector2 {
throw new CollectorException(msg); throw new CollectorException(msg);
} }
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
InputStream input = null; InputStream input = null;
long start = System.currentTimeMillis();
try { try {
if (getClientParams().getRequestDelay() > 0) { if (getClientParams().getRequestDelay() > 0) {
backoffAndSleep(getClientParams().getRequestDelay()); backoffAndSleep(getClientParams().getRequestDelay());
} }
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection(); final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
urlConn.setInstanceFollowRedirects(false); urlConn.setInstanceFollowRedirects(false);
urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000); urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
@ -115,9 +118,8 @@ public class HttpConnector2 {
urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue()); urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
} }
} }
if (log.isDebugEnabled()) {
logHeaderFields(urlConn); logHeaderFields(urlConn);
}
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields()); int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT); String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
@ -132,9 +134,7 @@ public class HttpConnector2 {
} }
if (is2xx(urlConn.getResponseCode())) { if (is2xx(urlConn.getResponseCode())) {
input = urlConn.getInputStream(); return getInputStream(urlConn, start);
responseType = urlConn.getContentType();
return input;
} }
if (is3xx(urlConn.getResponseCode())) { if (is3xx(urlConn.getResponseCode())) {
// REDIRECTS // REDIRECTS
@ -144,6 +144,7 @@ public class HttpConnector2 {
.put( .put(
REPORT_PREFIX + urlConn.getResponseCode(), REPORT_PREFIX + urlConn.getResponseCode(),
String.format("Moved to: %s", newUrl)); String.format("Moved to: %s", newUrl));
logRequestTime(start);
urlConn.disconnect(); urlConn.disconnect();
if (retryAfter > 0) { if (retryAfter > 0) {
backoffAndSleep(retryAfter); backoffAndSleep(retryAfter);
@ -159,26 +160,50 @@ public class HttpConnector2 {
if (retryAfter > 0) { if (retryAfter > 0) {
log log
.warn( .warn(
"{} - waiting and repeating request after suggested retry-after {} sec.", "waiting and repeating request after suggested retry-after {} sec for URL {}",
requestUrl, retryAfter); retryAfter, requestUrl);
backoffAndSleep(retryAfter * 1000); backoffAndSleep(retryAfter * 1000);
} else { } else {
log log
.warn( .warn(
"{} - waiting and repeating request after default delay of {} sec.", "waiting and repeating request after default delay of {} sec for URL {}",
requestUrl, getClientParams().getRetryDelay()); getClientParams().getRetryDelay(), requestUrl);
backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000); backoffAndSleep(retryNumber * getClientParams().getRetryDelay());
} }
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl); report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
logRequestTime(start);
urlConn.disconnect(); urlConn.disconnect();
return attemptDownload(requestUrl, retryNumber + 1, report); return attemptDownload(requestUrl, retryNumber + 1, report);
case 422: // UNPROCESSABLE ENTITY
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
log.warn("waiting and repeating request after 10 sec for URL {}", requestUrl);
backoffAndSleep(10000);
urlConn.disconnect();
logRequestTime(start);
try {
return getInputStream(urlConn, start);
} catch (IOException e) {
log
.error(
"server returned 422 and got IOException accessing the response body from URL {}",
requestUrl);
log.error("IOException:", e);
return attemptDownload(requestUrl, retryNumber + 1, report);
}
default: default:
log.error("gor error {} from URL: {}", urlConn.getResponseCode(), urlConn.getURL());
log.error("response message: {}", urlConn.getResponseMessage());
report report
.put( .put(
REPORT_PREFIX + urlConn.getResponseCode(), REPORT_PREFIX + urlConn.getResponseCode(),
String String
.format( .format(
"%s Error: %s", requestUrl, urlConn.getResponseMessage())); "%s Error: %s", requestUrl, urlConn.getResponseMessage()));
logRequestTime(start);
urlConn.disconnect();
throw new CollectorException(urlConn.getResponseCode() + " error " + report); throw new CollectorException(urlConn.getResponseCode() + " error " + report);
} }
} }
@ -199,13 +224,27 @@ public class HttpConnector2 {
} }
} }
private InputStream getInputStream(HttpURLConnection urlConn, long start) throws IOException {
InputStream input = urlConn.getInputStream();
responseType = urlConn.getContentType();
logRequestTime(start);
return input;
}
private static void logRequestTime(long start) {
log
.info(
"request time elapsed: {}sec",
TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis() - start));
}
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException { private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
log.debug("StatusCode: {}", urlConn.getResponseMessage()); log.info("Response: {} - {}", urlConn.getResponseCode(), urlConn.getResponseMessage());
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) { for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
if (e.getKey() != null) { if (e.getKey() != null) {
for (String v : e.getValue()) { for (String v : e.getValue()) {
log.debug(" key: {} - value: {}", e.getKey(), v); log.info(" key: {} - value: {}", e.getKey(), v);
} }
} }
} }
@ -225,7 +264,7 @@ public class HttpConnector2 {
for (String key : headerMap.keySet()) { for (String key : headerMap.keySet()) {
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty()) if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) { && NumberUtils.isCreatable(headerMap.get(key).get(0))) {
return Integer.parseInt(headerMap.get(key).get(0)) + 10; return Integer.parseInt(headerMap.get(key).get(0));
} }
} }
return -1; return -1;

View File

@ -145,105 +145,6 @@ public class AuthorMerger {
return null; return null;
} }
/**
* This method tries to figure out when two author are the same in the contest
* of ORCID enrichment
*
* @param left Author in the OAF entity
* @param right Author ORCID
* @return based on a heuristic on the names of the authors if they are the same.
*/
public static boolean checkORCIDSimilarity(final Author left, final Author right) {
final Person pl = parse(left);
final Person pr = parse(right);
// If one of them didn't have a surname we verify if they have the fullName not empty
// and verify if the normalized version is equal
if (!(pl.getSurname() != null && pl.getSurname().stream().anyMatch(StringUtils::isNotBlank) &&
pr.getSurname() != null && pr.getSurname().stream().anyMatch(StringUtils::isNotBlank))) {
if (pl.getFullname() != null && !pl.getFullname().isEmpty() && pr.getFullname() != null
&& !pr.getFullname().isEmpty()) {
return pl
.getFullname()
.stream()
.anyMatch(
fl -> pr.getFullname().stream().anyMatch(fr -> normalize(fl).equalsIgnoreCase(normalize(fr))));
} else {
return false;
}
}
// The Authors have one surname in common
if (pl.getSurname().stream().anyMatch(sl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(sl)))) {
// If one of them has only a surname and is the same we can say that they are the same author
if ((pl.getName() == null || pl.getName().stream().allMatch(StringUtils::isBlank)) ||
(pr.getName() == null || pr.getName().stream().allMatch(StringUtils::isBlank)))
return true;
// The authors have the same initials of Name in common
if (pl
.getName()
.stream()
.anyMatch(
nl -> pr
.getName()
.stream()
.anyMatch(nr -> nr.equalsIgnoreCase(nl))))
return true;
}
// Sometimes we noticed that publication have author wrote in inverse order Surname, Name
// We verify if we have an exact match between name and surname
if (pl.getSurname().stream().anyMatch(sl -> pr.getName().stream().anyMatch(nr -> nr.equalsIgnoreCase(sl))) &&
pl.getName().stream().anyMatch(nl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(nl))))
return true;
else
return false;
}
//
/**
* Method to enrich ORCID information in one list of authors based on another list
*
* @param baseAuthor the Author List in the OAF Entity
* @param orcidAuthor The list of ORCID Author intersected
* @return The Author List of the OAF Entity enriched with the orcid Author
*/
public static List<Author> enrichOrcid(List<Author> baseAuthor, List<Author> orcidAuthor) {
if (baseAuthor == null || baseAuthor.isEmpty())
return orcidAuthor;
if (orcidAuthor == null || orcidAuthor.isEmpty())
return baseAuthor;
if (baseAuthor.size() == 1 && orcidAuthor.size() > 10)
return baseAuthor;
final List<Author> oAuthor = new ArrayList<>();
oAuthor.addAll(orcidAuthor);
baseAuthor.forEach(ba -> {
Optional<Author> aMatch = oAuthor.stream().filter(oa -> checkORCIDSimilarity(ba, oa)).findFirst();
if (aMatch.isPresent()) {
final Author sameAuthor = aMatch.get();
addPid(ba, sameAuthor.getPid());
oAuthor.remove(sameAuthor);
}
});
return baseAuthor;
}
private static void addPid(final Author a, final List<StructuredProperty> pids) {
if (a.getPid() == null) {
a.setPid(new ArrayList<>());
}
a.getPid().addAll(pids);
}
public static String pidToComparableString(StructuredProperty pid) { public static String pidToComparableString(StructuredProperty pid) {
final String classid = pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase() final String classid = pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase()
: ""; : "";

View File

@ -14,7 +14,7 @@ import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils; import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.api.java.function.ReduceFunction; import org.apache.spark.api.java.function.MapGroupsFunction;
import org.apache.spark.sql.*; import org.apache.spark.sql.*;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@ -26,7 +26,7 @@ import eu.dnetlib.dhp.schema.common.EntityType;
import eu.dnetlib.dhp.schema.common.ModelSupport; import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.OafEntity; import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.utils.GraphCleaningFunctions; import eu.dnetlib.dhp.schema.oaf.utils.GraphCleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils; import eu.dnetlib.dhp.schema.oaf.utils.MergeUtils;
import eu.dnetlib.dhp.utils.ISLookupClientFactory; import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException; import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService; import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
@ -135,10 +135,10 @@ public class GroupEntitiesSparkJob {
.applyCoarVocabularies(entity, vocs), .applyCoarVocabularies(entity, vocs),
OAFENTITY_KRYO_ENC) OAFENTITY_KRYO_ENC)
.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING()) .groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
.reduceGroups((ReduceFunction<OafEntity>) OafMapperUtils::mergeEntities) .mapGroups((MapGroupsFunction<String, OafEntity, OafEntity>) MergeUtils::mergeGroup, OAFENTITY_KRYO_ENC)
.map( .map(
(MapFunction<Tuple2<String, OafEntity>, Tuple2<String, OafEntity>>) t -> new Tuple2<>( (MapFunction<OafEntity, Tuple2<String, OafEntity>>) t -> new Tuple2<>(
t._2().getClass().getName(), t._2()), t.getClass().getName(), t),
Encoders.tuple(Encoders.STRING(), OAFENTITY_KRYO_ENC)); Encoders.tuple(Encoders.STRING(), OAFENTITY_KRYO_ENC));
// pivot on "_1" (classname of the entity) // pivot on "_1" (classname of the entity)

View File

@ -0,0 +1,77 @@
package eu.dnetlib.dhp.oozie;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import org.apache.commons.lang3.time.DurationFormatUtils;
import org.apache.commons.text.StringSubstitutor;
import org.apache.spark.SparkConf;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.io.Resources;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class RunSQLSparkJob {
private static final Logger log = LoggerFactory.getLogger(RunSQLSparkJob.class);
private final ArgumentApplicationParser parser;
public RunSQLSparkJob(ArgumentApplicationParser parser) {
this.parser = parser;
}
public static void main(String[] args) throws Exception {
Map<String, String> params = new HashMap<>();
for (int i = 0; i < args.length - 1; i++) {
if (args[i].startsWith("--")) {
params.put(args[i].substring(2), args[++i]);
}
}
/*
* String jsonConfiguration = IOUtils .toString( Objects .requireNonNull( RunSQLSparkJob.class
* .getResourceAsStream( "/eu/dnetlib/dhp/oozie/run_sql_parameters.json"))); final ArgumentApplicationParser
* parser = new ArgumentApplicationParser(jsonConfiguration); parser.parseArgument(args);
*/
Boolean isSparkSessionManaged = Optional
.ofNullable(params.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
URL url = com.google.common.io.Resources.getResource(params.get("sql"));
String raw_sql = Resources.toString(url, StandardCharsets.UTF_8);
String sql = StringSubstitutor.replace(raw_sql, params);
log.info("sql: {}", sql);
SparkConf conf = new SparkConf();
conf.set("hive.metastore.uris", params.get("hiveMetastoreUris"));
runWithSparkHiveSession(
conf,
isSparkSessionManaged,
spark -> {
for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {
log.info("executing: {}", statement);
long startTime = System.currentTimeMillis();
spark.sql(statement).show();
log
.info(
"executed in {}",
DurationFormatUtils.formatDuration(System.currentTimeMillis() - startTime, "HH:mm:ss.S"));
}
});
}
}

View File

@ -0,0 +1,76 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.HashSet;
import java.util.Objects;
import java.util.Optional;
import java.util.Set;
import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class CleaningFunctions {
public static final String DOI_PREFIX_REGEX = "(^10\\.|\\/10\\.)";
public static final String DOI_PREFIX = "10.";
public static final Set<String> PID_BLACKLIST = new HashSet<>();
static {
PID_BLACKLIST.add("none");
PID_BLACKLIST.add("na");
}
public CleaningFunctions() {
}
/**
* Utility method that filter PID values on a per-type basis.
* @param s the PID whose value will be checked.
* @return false if the pid matches the filter criteria, true otherwise.
*/
public static boolean pidFilter(StructuredProperty s) {
final String pidValue = s.getValue();
if (Objects.isNull(s.getQualifier()) ||
StringUtils.isBlank(pidValue) ||
StringUtils.isBlank(pidValue.replaceAll("(?:\\n|\\r|\\t|\\s)", ""))) {
return false;
}
if (CleaningFunctions.PID_BLACKLIST.contains(pidValue)) {
return false;
}
return !PidBlacklistProvider.getBlacklist(s.getQualifier().getClassid()).contains(pidValue);
}
/**
* Utility method that normalises PID values on a per-type basis.
* @param pid the PID whose value will be normalised.
* @return the PID containing the normalised value.
*/
public static StructuredProperty normalizePidValue(StructuredProperty pid) {
pid
.setValue(
normalizePidValue(
pid.getQualifier().getClassid(),
pid.getValue()));
return pid;
}
public static String normalizePidValue(String pidType, String pidValue) {
String value = Optional
.ofNullable(pidValue)
.map(String::trim)
.orElseThrow(() -> new IllegalArgumentException("PID value cannot be empty"));
switch (pidType) {
// TODO add cleaning for more PID types as needed
case "doi":
return value.toLowerCase().replaceFirst(DOI_PREFIX_REGEX, DOI_PREFIX);
}
return value;
}
}

View File

@ -1,6 +1,8 @@
package eu.dnetlib.dhp.schema.oaf.utils; package eu.dnetlib.dhp.schema.oaf.utils;
import org.apache.commons.lang3.StringUtils;
public class DoiCleaningRule { public class DoiCleaningRule {
public static String clean(final String doi) { public static String clean(final String doi) {
@ -11,4 +13,26 @@ public class DoiCleaningRule {
.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX); .replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);
} }
public static String normalizeDoi(final String input) {
if (input == null)
return null;
final String replaced = input
.replaceAll("\\n|\\r|\\t|\\s", "")
.toLowerCase()
.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);
if (StringUtils.isEmpty(replaced))
return null;
if (!replaced.contains("10."))
return null;
final String ret = replaced.substring(replaced.indexOf("10."));
if (!ret.startsWith(CleaningFunctions.DOI_PREFIX))
return null;
return ret;
}
} }

View File

@ -23,6 +23,7 @@ import com.google.common.collect.Lists;
import com.google.common.collect.Sets; import com.google.common.collect.Sets;
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup; import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
import eu.dnetlib.dhp.common.vocabulary.VocabularyTerm;
import eu.dnetlib.dhp.schema.common.ModelConstants; import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport; import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*; import eu.dnetlib.dhp.schema.oaf.*;
@ -30,6 +31,10 @@ import me.xuender.unidecode.Unidecode;
public class GraphCleaningFunctions extends CleaningFunctions { public class GraphCleaningFunctions extends CleaningFunctions {
public static final String DNET_PUBLISHERS = "dnet:publishers";
public static final String DNET_LICENSES = "dnet:licenses";
public static final String ORCID_CLEANING_REGEX = ".*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9x]{4})"; public static final String ORCID_CLEANING_REGEX = ".*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9x]{4})";
public static final int ORCID_LEN = 19; public static final int ORCID_LEN = 19;
public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)"; public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)";
@ -87,6 +92,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
INVALID_AUTHOR_NAMES.add("null anonymous"); INVALID_AUTHOR_NAMES.add("null anonymous");
INVALID_AUTHOR_NAMES.add("unbekannt"); INVALID_AUTHOR_NAMES.add("unbekannt");
INVALID_AUTHOR_NAMES.add("unknown"); INVALID_AUTHOR_NAMES.add("unknown");
INVALID_AUTHOR_NAMES.add("autor, Sin");
INVALID_AUTHOR_NAMES.add("Desconocido / Inconnu,");
INVALID_URL_HOSTS.add("creativecommons.org"); INVALID_URL_HOSTS.add("creativecommons.org");
INVALID_URL_HOSTS.add("www.academia.edu"); INVALID_URL_HOSTS.add("www.academia.edu");
@ -307,7 +314,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
} }
if (value instanceof Datasource) { if (value instanceof Datasource) {
// nothing to evaluate here final Datasource d = (Datasource) value;
return Objects.nonNull(d.getOfficialname()) && StringUtils.isNotBlank(d.getOfficialname().getValue());
} else if (value instanceof Project) { } else if (value instanceof Project) {
final Project p = (Project) value; final Project p = (Project) value;
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue()); return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
@ -409,6 +417,14 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.getPublisher() .getPublisher()
.getValue() .getValue()
.replaceAll(NAME_CLEANING_REGEX, " ")); .replaceAll(NAME_CLEANING_REGEX, " "));
if (vocs.vocabularyExists(DNET_PUBLISHERS)) {
vocs
.find(DNET_PUBLISHERS)
.map(voc -> voc.getTermBySynonym(r.getPublisher().getValue()))
.map(VocabularyTerm::getName)
.ifPresent(publisher -> r.getPublisher().setValue(publisher));
}
} }
} }
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) { if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
@ -492,6 +508,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(Objects::nonNull) .filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue())) .filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.map(GraphCleaningFunctions::cleanValue) .map(GraphCleaningFunctions::cleanValue)
.sorted((s1, s2) -> s2.getValue().length() - s1.getValue().length())
.limit(ModelHardLimits.MAX_ABSTRACTS)
.collect(Collectors.toList())); .collect(Collectors.toList()));
} }
if (Objects.isNull(r.getResourcetype()) || StringUtils.isBlank(r.getResourcetype().getClassid())) { if (Objects.isNull(r.getResourcetype()) || StringUtils.isBlank(r.getResourcetype().getClassid())) {
@ -569,6 +587,14 @@ public class GraphCleaningFunctions extends CleaningFunctions {
i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS)); i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS));
} }
if (Objects.nonNull(i.getLicense()) && Objects.nonNull(i.getLicense().getValue())) {
vocs
.find(DNET_LICENSES)
.map(voc -> voc.getTermBySynonym(i.getLicense().getValue()))
.map(VocabularyTerm::getId)
.ifPresent(license -> i.getLicense().setValue(license));
}
// from the script from Dimitris // from the script from Dimitris
if ("0000".equals(i.getRefereed().getClassid())) { if ("0000".equals(i.getRefereed().getClassid())) {
final boolean isFromCrossref = Optional final boolean isFromCrossref = Optional
@ -668,6 +694,9 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(Objects::nonNull) .filter(Objects::nonNull)
.filter(p -> Objects.nonNull(p.getQualifier())) .filter(p -> Objects.nonNull(p.getQualifier()))
.filter(p -> StringUtils.isNotBlank(p.getValue())) .filter(p -> StringUtils.isNotBlank(p.getValue()))
.filter(
p -> StringUtils
.contains(StringUtils.lowerCase(p.getQualifier().getClassid()), ORCID))
.map(p -> { .map(p -> {
// hack to distinguish orcid from orcid_pending // hack to distinguish orcid from orcid_pending
String pidProvenance = getProvenance(p.getDataInfo()); String pidProvenance = getProvenance(p.getDataInfo());
@ -677,7 +706,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.toLowerCase() .toLowerCase()
.contains(ModelConstants.ORCID)) { .contains(ModelConstants.ORCID)) {
if (pidProvenance if (pidProvenance
.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY)) { .equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY) ||
pidProvenance.equals("ORCID_ENRICHMENT")) {
p.getQualifier().setClassid(ModelConstants.ORCID); p.getQualifier().setClassid(ModelConstants.ORCID);
} else { } else {
p.getQualifier().setClassid(ModelConstants.ORCID_PENDING); p.getQualifier().setClassid(ModelConstants.ORCID_PENDING);

View File

@ -0,0 +1,294 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static com.google.common.base.Preconditions.checkArgument;
import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
import java.io.Serializable;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.*;
import java.util.function.Function;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.apache.commons.codec.binary.Hex;
import org.apache.commons.lang3.StringUtils;
import com.google.common.collect.HashBiMap;
import com.google.common.collect.Maps;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
/**
* Factory class for OpenAIRE identifiers in the Graph
*/
public class IdentifierFactory implements Serializable {
public static final String ID_SEPARATOR = "::";
public static final String ID_PREFIX_SEPARATOR = "|";
public static final int ID_PREFIX_LEN = 12;
/**
* Declares the associations PID_TYPE -> [DATASOURCE ID, NAME] considered authoritative for that PID_TYPE.
* The id of the record (source_::id) will be rewritten as pidType_::id)
*/
public static final Map<PidType, HashBiMap<String, String>> PID_AUTHORITY = Maps.newHashMap();
static {
PID_AUTHORITY.put(PidType.doi, HashBiMap.create());
PID_AUTHORITY.get(PidType.doi).put(CROSSREF_ID, "Crossref");
PID_AUTHORITY.get(PidType.doi).put(DATACITE_ID, "Datacite");
PID_AUTHORITY.get(PidType.doi).put(ZENODO_OD_ID, "ZENODO");
PID_AUTHORITY.get(PidType.doi).put(ZENODO_R3_ID, "Zenodo");
PID_AUTHORITY.put(PidType.pmc, HashBiMap.create());
PID_AUTHORITY.get(PidType.pmc).put(EUROPE_PUBMED_CENTRAL_ID, "Europe PubMed Central");
PID_AUTHORITY.get(PidType.pmc).put(PUBMED_CENTRAL_ID, "PubMed Central");
PID_AUTHORITY.put(PidType.pmid, HashBiMap.create());
PID_AUTHORITY.get(PidType.pmid).put(EUROPE_PUBMED_CENTRAL_ID, "Europe PubMed Central");
PID_AUTHORITY.get(PidType.pmid).put(PUBMED_CENTRAL_ID, "PubMed Central");
PID_AUTHORITY.put(PidType.arXiv, HashBiMap.create());
PID_AUTHORITY.get(PidType.arXiv).put(ARXIV_ID, "arXiv.org e-Print Archive");
PID_AUTHORITY.put(PidType.w3id, HashBiMap.create());
PID_AUTHORITY.get(PidType.w3id).put(ROHUB_ID, "ROHub");
}
/**
* Declares the associations PID_TYPE -> [DATASOURCE ID, PID SUBSTRING] considered as delegated authority for that
* PID_TYPE. Example, Zenodo is delegated to forge DOIs that contain the 'zenodo' word.
*
* If a record with the same id (same pid) comes from 2 data sources, the one coming from a delegated source wins. E.g. Zenodo records win over those from Datacite.
* See also https://code-repo.d4science.org/D-Net/dnet-hadoop/pulls/187 and the class dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/OafMapperUtils.java
*/
public static final Map<PidType, Map<String, String>> DELEGATED_PID_AUTHORITY = Maps.newHashMap();
static {
DELEGATED_PID_AUTHORITY.put(PidType.doi, new HashMap<>());
DELEGATED_PID_AUTHORITY.get(PidType.doi).put(ZENODO_OD_ID, "zenodo");
DELEGATED_PID_AUTHORITY.get(PidType.doi).put(ZENODO_R3_ID, "zenodo");
DELEGATED_PID_AUTHORITY.put(PidType.w3id, new HashMap<>());
DELEGATED_PID_AUTHORITY.get(PidType.w3id).put(ROHUB_ID, "ro-id");
}
/**
* Declares the associations PID_TYPE -> [DATASOURCE ID, NAME] whose records are considered enrichment for the graph.
* Their OpenAIRE ID is built from the declared PID type. Are merged with their corresponding record, identified by
* the same OpenAIRE id.
*/
public static final Map<PidType, HashBiMap<String, String>> ENRICHMENT_PROVIDER = Maps.newHashMap();
static {
ENRICHMENT_PROVIDER.put(PidType.doi, HashBiMap.create());
ENRICHMENT_PROVIDER.get(PidType.doi).put(OPEN_APC_ID, OPEN_APC_NAME);
}
public static Set<String> delegatedAuthorityDatasourceIds() {
return DELEGATED_PID_AUTHORITY
.values()
.stream()
.flatMap(m -> m.keySet().stream())
.collect(Collectors.toCollection(HashSet::new));
}
public static List<StructuredProperty> getPids(List<StructuredProperty> pid, KeyValue collectedFrom) {
return pidFromInstance(pid, collectedFrom, true).distinct().collect(Collectors.toList());
}
public static <T extends Result> String createDOIBoostIdentifier(T entity) {
if (entity == null)
return null;
StructuredProperty pid = null;
if (entity.getPid() != null) {
pid = entity
.getPid()
.stream()
.filter(Objects::nonNull)
.filter(s -> s.getQualifier() != null && "doi".equalsIgnoreCase(s.getQualifier().getClassid()))
.filter(CleaningFunctions::pidFilter)
.findAny()
.orElse(null);
} else {
if (entity.getInstance() != null) {
pid = entity
.getInstance()
.stream()
.filter(i -> i.getPid() != null)
.flatMap(i -> i.getPid().stream())
.filter(CleaningFunctions::pidFilter)
.findAny()
.orElse(null);
}
}
if (pid != null)
return idFromPid(entity, pid, true);
return null;
}
/**
* Creates an identifier from the most relevant PID (if available) provided by a known PID authority in the given
* entity T. Returns entity.id when none of the PIDs meet the selection criteria is available.
*
* @param entity the entity providing PIDs and a default ID.
* @param <T> the specific entity type. Currently Organization and Result subclasses are supported.
* @param md5 indicates whether should hash the PID value or not.
* @return an identifier from the most relevant PID, entity.id otherwise
*/
public static <T extends OafEntity> String createIdentifier(T entity, boolean md5) {
checkArgument(StringUtils.isNoneBlank(entity.getId()), "missing entity identifier");
final Map<String, Set<StructuredProperty>> pids = extractPids(entity);
return pids
.values()
.stream()
.flatMap(Set::stream)
.min(new PidComparator<>(entity))
.map(
min -> Optional
.ofNullable(pids.get(min.getQualifier().getClassid()))
.map(
p -> p
.stream()
.sorted(new PidValueComparator())
.findFirst()
.map(s -> idFromPid(entity, s, md5))
.orElseGet(entity::getId))
.orElseGet(entity::getId))
.orElseGet(entity::getId);
}
private static <T extends OafEntity> Map<String, Set<StructuredProperty>> extractPids(T entity) {
if (entity instanceof Result) {
return Optional
.ofNullable(((Result) entity).getInstance())
.map(IdentifierFactory::mapPids)
.orElse(new HashMap<>());
} else {
return entity
.getPid()
.stream()
.map(CleaningFunctions::normalizePidValue)
.filter(CleaningFunctions::pidFilter)
.collect(
Collectors
.groupingBy(
p -> p.getQualifier().getClassid(),
Collectors.mapping(p -> p, Collectors.toCollection(HashSet::new))));
}
}
private static Map<String, Set<StructuredProperty>> mapPids(List<Instance> instance) {
return instance
.stream()
.map(i -> pidFromInstance(i.getPid(), i.getCollectedfrom(), false))
.flatMap(Function.identity())
.collect(
Collectors
.groupingBy(
p -> p.getQualifier().getClassid(),
Collectors.mapping(p -> p, Collectors.toCollection(HashSet::new))));
}
private static Stream<StructuredProperty> pidFromInstance(List<StructuredProperty> pid, KeyValue collectedFrom,
boolean mapHandles) {
return Optional
.ofNullable(pid)
.map(
pp -> pp
.stream()
// filter away PIDs provided by a DS that is not considered an authority for the
// given PID Type
.filter(p -> shouldFilterPidByCriteria(collectedFrom, p, mapHandles))
.map(CleaningFunctions::normalizePidValue)
.filter(p -> isNotFromDelegatedAuthority(collectedFrom, p))
.filter(CleaningFunctions::pidFilter))
.orElse(Stream.empty());
}
private static boolean shouldFilterPidByCriteria(KeyValue collectedFrom, StructuredProperty p, boolean mapHandles) {
final PidType pType = PidType.tryValueOf(p.getQualifier().getClassid());
if (Objects.isNull(collectedFrom)) {
return false;
}
boolean isEnrich = Optional
.ofNullable(ENRICHMENT_PROVIDER.get(pType))
.map(
enrich -> enrich.containsKey(collectedFrom.getKey())
|| enrich.containsValue(collectedFrom.getValue()))
.orElse(false);
boolean isAuthority = Optional
.ofNullable(PID_AUTHORITY.get(pType))
.map(
authorities -> authorities.containsKey(collectedFrom.getKey())
|| authorities.containsValue(collectedFrom.getValue()))
.orElse(false);
return (mapHandles && pType.equals(PidType.handle)) || isEnrich || isAuthority;
}
private static boolean isNotFromDelegatedAuthority(KeyValue collectedFrom, StructuredProperty p) {
final PidType pType = PidType.tryValueOf(p.getQualifier().getClassid());
final Map<String, String> da = DELEGATED_PID_AUTHORITY.get(pType);
if (Objects.isNull(da)) {
return true;
}
if (!da.containsKey(collectedFrom.getKey())) {
return true;
}
return StringUtils.contains(p.getValue(), da.get(collectedFrom.getKey()));
}
/**
* @see {@link IdentifierFactory#createIdentifier(OafEntity, boolean)}
*/
public static <T extends OafEntity> String createIdentifier(T entity) {
return createIdentifier(entity, true);
}
private static <T extends OafEntity> String idFromPid(T entity, StructuredProperty s, boolean md5) {
return idFromPid(ModelSupport.getIdPrefix(entity.getClass()), s.getQualifier().getClassid(), s.getValue(), md5);
}
public static String idFromPid(String numericPrefix, String pidType, String pidValue, boolean md5) {
return new StringBuilder()
.append(numericPrefix)
.append(ID_PREFIX_SEPARATOR)
.append(createPrefix(pidType))
.append(ID_SEPARATOR)
.append(md5 ? md5(pidValue) : pidValue)
.toString();
}
// create the prefix (length = 12)
private static String createPrefix(String pidType) {
StringBuilder prefix = new StringBuilder(StringUtils.left(pidType, ID_PREFIX_LEN));
while (prefix.length() < ID_PREFIX_LEN) {
prefix.append("_");
}
return prefix.substring(0, ID_PREFIX_LEN);
}
public static String md5(final String s) {
try {
final MessageDigest md = MessageDigest.getInstance("MD5");
md.update(s.getBytes(StandardCharsets.UTF_8));
return new String(Hex.encodeHex(md.digest()));
} catch (final Exception e) {
return null;
}
}
}

View File

@ -0,0 +1,78 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Optional;
import java.util.stream.Collectors;
//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by FernFlower decompiler)
//
import eu.dnetlib.dhp.schema.common.EntityType;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Oaf;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.Result;
public class MergeComparator implements Comparator<Oaf> {
public MergeComparator() {
}
public int compare(Oaf left, Oaf right) {
// nulls at the end
if (left == null && right == null) {
return 0;
} else if (left == null) {
return -1;
} else if (right == null) {
return 1;
}
// invisible
if (left.getDataInfo() != null && left.getDataInfo().getInvisible() == true) {
if (right.getDataInfo() != null && right.getDataInfo().getInvisible() == false) {
return -1;
}
}
// collectedfrom
HashSet<String> lCf = getCollectedFromIds(left);
HashSet<String> rCf = getCollectedFromIds(right);
if (lCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")
&& !rCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")) {
return -1;
} else if (!lCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")
&& rCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")) {
return 1;
}
SubEntityType lClass = SubEntityType.fromClass(left.getClass());
SubEntityType rClass = SubEntityType.fromClass(right.getClass());
return lClass.ordinal() - rClass.ordinal();
}
protected HashSet<String> getCollectedFromIds(Oaf left) {
return (HashSet) Optional.ofNullable(left.getCollectedfrom()).map((cf) -> {
return (HashSet) cf.stream().map(KeyValue::getKey).collect(Collectors.toCollection(HashSet::new));
}).orElse(new HashSet());
}
enum SubEntityType {
publication, dataset, software, otherresearchproduct, datasource, organization, project;
/**
* Resolves the EntityType, given the relative class name
*
* @param clazz the given class name
* @param <T> actual OafEntity subclass
* @return the EntityType associated to the given class
*/
public static <T extends Oaf> SubEntityType fromClass(Class<T> clazz) {
return valueOf(clazz.getSimpleName().toLowerCase());
}
}
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,27 @@
package eu.dnetlib.dhp.schema.oaf.utils;
public class ModelHardLimits {
private ModelHardLimits() {
}
public static final String LAYOUT = "index";
public static final String INTERPRETATION = "openaire";
public static final String SEPARATOR = "-";
public static final int MAX_EXTERNAL_ENTITIES = 50;
public static final int MAX_AUTHORS = 200;
public static final int MAX_AUTHOR_FULLNAME_LENGTH = 1000;
public static final int MAX_TITLE_LENGTH = 5000;
public static final int MAX_TITLES = 10;
public static final int MAX_ABSTRACTS = 10;
public static final int MAX_ABSTRACT_LENGTH = 150000;
public static final int MAX_RELATED_ABSTRACT_LENGTH = 500;
public static final int MAX_INSTANCES = 10;
public static String getCollectionName(String format) {
return format + SEPARATOR + LAYOUT + SEPARATOR + INTERPRETATION;
}
}

View File

@ -14,7 +14,6 @@ import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.dhp.schema.common.AccessRightComparator; import eu.dnetlib.dhp.schema.common.AccessRightComparator;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*; import eu.dnetlib.dhp.schema.oaf.*;
public class OafMapperUtils { public class OafMapperUtils {
@ -22,65 +21,6 @@ public class OafMapperUtils {
private OafMapperUtils() { private OafMapperUtils() {
} }
public static Oaf merge(final Oaf left, final Oaf right) {
if (ModelSupport.isSubClass(left, OafEntity.class)) {
return mergeEntities((OafEntity) left, (OafEntity) right);
} else if (ModelSupport.isSubClass(left, Relation.class)) {
((Relation) left).mergeFrom((Relation) right);
} else {
throw new IllegalArgumentException("invalid Oaf type:" + left.getClass().getCanonicalName());
}
return left;
}
public static OafEntity mergeEntities(OafEntity left, OafEntity right) {
if (ModelSupport.isSubClass(left, Result.class)) {
return mergeResults((Result) left, (Result) right);
} else if (ModelSupport.isSubClass(left, Datasource.class)) {
left.mergeFrom(right);
} else if (ModelSupport.isSubClass(left, Organization.class)) {
left.mergeFrom(right);
} else if (ModelSupport.isSubClass(left, Project.class)) {
left.mergeFrom(right);
} else {
throw new IllegalArgumentException("invalid OafEntity subtype:" + left.getClass().getCanonicalName());
}
return left;
}
public static Result mergeResults(Result left, Result right) {
final boolean leftFromDelegatedAuthority = isFromDelegatedAuthority(left);
final boolean rightFromDelegatedAuthority = isFromDelegatedAuthority(right);
if (leftFromDelegatedAuthority && !rightFromDelegatedAuthority) {
return left;
}
if (!leftFromDelegatedAuthority && rightFromDelegatedAuthority) {
return right;
}
if (new ResultTypeComparator().compare(left, right) < 0) {
left.mergeFrom(right);
return left;
} else {
right.mergeFrom(left);
return right;
}
}
private static boolean isFromDelegatedAuthority(Result r) {
return Optional
.ofNullable(r.getInstance())
.map(
instance -> instance
.stream()
.filter(i -> Objects.nonNull(i.getCollectedfrom()))
.map(i -> i.getCollectedfrom().getKey())
.anyMatch(cfId -> IdentifierFactory.delegatedAuthorityDatasourceIds().contains(cfId)))
.orElse(false);
}
public static KeyValue keyValue(final String k, final String v) { public static KeyValue keyValue(final String k, final String v) {
final KeyValue kv = new KeyValue(); final KeyValue kv = new KeyValue();
kv.setKey(k); kv.setKey(k);

View File

@ -0,0 +1,46 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class OrganizationPidComparator implements Comparator<StructuredProperty> {
@Override
public int compare(StructuredProperty left, StructuredProperty right) {
if (left == null) {
return right == null ? 0 : -1;
} else if (right == null) {
return 1;
}
PidType lClass = PidType.tryValueOf(left.getQualifier().getClassid());
PidType rClass = PidType.tryValueOf(right.getQualifier().getClassid());
if (lClass.equals(rClass))
return 0;
if (lClass.equals(PidType.openorgs))
return -1;
if (rClass.equals(PidType.openorgs))
return 1;
if (lClass.equals(PidType.GRID))
return -1;
if (rClass.equals(PidType.GRID))
return 1;
if (lClass.equals(PidType.mag_id))
return -1;
if (rClass.equals(PidType.mag_id))
return 1;
if (lClass.equals(PidType.urn))
return -1;
if (rClass.equals(PidType.urn))
return 1;
return 0;
}
}

View File

@ -0,0 +1,8 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.HashMap;
import java.util.HashSet;
public class PidBlacklist extends HashMap<String, HashSet<String>> {
}

View File

@ -0,0 +1,40 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.io.IOException;
import java.util.HashSet;
import java.util.Optional;
import java.util.Set;
import org.apache.commons.io.IOUtils;
import com.fasterxml.jackson.databind.ObjectMapper;
public class PidBlacklistProvider {
private static final PidBlacklist blacklist;
static {
try {
String json = IOUtils.toString(IdentifierFactory.class.getResourceAsStream("pid_blacklist.json"));
blacklist = new ObjectMapper().readValue(json, PidBlacklist.class);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
public static PidBlacklist getBlacklist() {
return blacklist;
}
public static Set<String> getBlacklist(String pidType) {
return Optional
.ofNullable(getBlacklist().get(pidType))
.orElse(new HashSet<>());
}
private PidBlacklistProvider() {
}
}

View File

@ -0,0 +1,48 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.Organization;
import eu.dnetlib.dhp.schema.oaf.Result;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class PidComparator<T extends OafEntity> implements Comparator<StructuredProperty> {
private final T entity;
public PidComparator(T entity) {
this.entity = entity;
}
@Override
public int compare(StructuredProperty left, StructuredProperty right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
if (ModelSupport.isSubClass(entity, Result.class)) {
return compareResultPids(left, right);
}
if (ModelSupport.isSubClass(entity, Organization.class)) {
return compareOrganizationtPids(left, right);
}
// Else (but unlikely), lexicographical ordering will do.
return left.getQualifier().getClassid().compareTo(right.getQualifier().getClassid());
}
private int compareResultPids(StructuredProperty left, StructuredProperty right) {
return new ResultPidComparator().compare(left, right);
}
private int compareOrganizationtPids(StructuredProperty left, StructuredProperty right) {
return new OrganizationPidComparator().compare(left, right);
}
}

View File

@ -0,0 +1,79 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import org.apache.commons.lang3.EnumUtils;
public enum PidType {
/**
* The DOI syntax shall be made up of a DOI prefix and a DOI suffix separated by a forward slash.
*
* There is no defined limit on the length of the DOI name, or of the DOI prefix or DOI suffix.
*
* The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters
* of Unicode. Further constraints on character use (e.g. use of language-specific alphanumeric characters) can be
* defined for an application by the ISO 26324 Registration Authority.
*
*
* DOI prefix: The DOI prefix shall be composed of a directory indicator followed by a registrant code.
* These two components shall be separated by a full stop (period). The directory indicator shall be "10" and
* distinguishes the entire set of character strings (prefix and suffix) as digital object identifiers within the
* resolution system.
*
* Registrant code: The second element of the DOI prefix shall be the registrant code. The registrant code is a
* unique string assigned to a registrant.
*
* DOI suffix: The DOI suffix shall consist of a character string of any length chosen by the registrant.
* Each suffix shall be unique to the prefix element that precedes it. The unique suffix can be a sequential number,
* or it might incorporate an identifier generated from or based on another system used by the registrant
* (e.g. ISAN, ISBN, ISRC, ISSN, ISTC, ISNI; in such cases, a preferred construction for such a suffix can be
* specified, as in Example 1).
*
* Source: https://www.doi.org/doi_handbook/2_Numbering.html#2.2
*/
doi,
/**
* PubMed Unique Identifier (PMID)
*
* This field is a 1-to-8 digit accession number with no leading zeros. It is present on all records and is the
* accession number for managing and disseminating records. PMIDs are not reused after records are deleted.
*
* Beginning in February 2012 PMIDs include extensions following a decimal point to account for article versions
* (e.g., 21804956.2). All citations are considered version 1 until replaced. The extended PMID is not displayed
* on the MEDLINE format.
*
* View the citation in abstract format in PubMed to access additional versions when available (see the article in
* the Jan-Feb 2012 NLM Technical Bulletin).
*
* Source: https://www.nlm.nih.gov/bsd/mms/medlineelements.html#pmid
*/
pmid,
/**
* This field contains the unique identifier for the cited article in PubMed Central. The identifier begins with the
* prefix PMC.
*
* Source: https://www.nlm.nih.gov/bsd/mms/medlineelements.html#pmc
*/
pmc, handle, arXiv, nct, pdb, w3id,
// Organization
openorgs, ROR, GRID, PIC, ISNI, Wikidata, FundRef, corda, corda_h2020, mag_id, urn,
// Used by dedup
undefined, original;
public static boolean isValid(String type) {
return EnumUtils.isValidEnum(PidType.class, type);
}
public static PidType tryValueOf(String s) {
try {
return PidType.valueOf(s);
} catch (Exception e) {
return PidType.original;
}
}
}

View File

@ -0,0 +1,33 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import java.util.Optional;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class PidValueComparator implements Comparator<StructuredProperty> {
@Override
public int compare(StructuredProperty left, StructuredProperty right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
StructuredProperty l = CleaningFunctions.normalizePidValue(left);
StructuredProperty r = CleaningFunctions.normalizePidValue(right);
return Optional
.ofNullable(l.getValue())
.map(
lv -> Optional
.ofNullable(r.getValue())
.map(rv -> lv.compareTo(rv))
.orElse(-1))
.orElse(1);
}
}

View File

@ -0,0 +1,46 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
/**
* Comparator for sorting the values from the dnet:review_levels vocabulary, implements the following ordering
*
* peerReviewed (0001) > nonPeerReviewed (0002) > UNKNOWN (0000)
*/
public class RefereedComparator implements Comparator<Qualifier> {
@Override
public int compare(Qualifier left, Qualifier right) {
if (left == null || left.getClassid() == null) {
return (right == null || right.getClassid() == null) ? 0 : -1;
} else if (right == null || right.getClassid() == null) {
return 1;
}
String lClass = left.getClassid();
String rClass = right.getClassid();
if (lClass.equals(rClass))
return 0;
if ("0001".equals(lClass))
return -1;
if ("0001".equals(rClass))
return 1;
if ("0002".equals(lClass))
return -1;
if ("0002".equals(rClass))
return 1;
if ("0000".equals(lClass))
return -1;
if ("0000".equals(rClass))
return 1;
return 0;
}
}

View File

@ -0,0 +1,56 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class ResultPidComparator implements Comparator<StructuredProperty> {
@Override
public int compare(StructuredProperty left, StructuredProperty right) {
PidType lClass = PidType.tryValueOf(left.getQualifier().getClassid());
PidType rClass = PidType.tryValueOf(right.getQualifier().getClassid());
if (lClass.equals(rClass))
return 0;
if (lClass.equals(PidType.doi))
return -1;
if (rClass.equals(PidType.doi))
return 1;
if (lClass.equals(PidType.pmid))
return -1;
if (rClass.equals(PidType.pmid))
return 1;
if (lClass.equals(PidType.pmc))
return -1;
if (rClass.equals(PidType.pmc))
return 1;
if (lClass.equals(PidType.handle))
return -1;
if (rClass.equals(PidType.handle))
return 1;
if (lClass.equals(PidType.arXiv))
return -1;
if (rClass.equals(PidType.arXiv))
return 1;
if (lClass.equals(PidType.nct))
return -1;
if (rClass.equals(PidType.nct))
return 1;
if (lClass.equals(PidType.pdb))
return -1;
if (rClass.equals(PidType.pdb))
return 1;
return 0;
}
}

View File

@ -0,0 +1,78 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static eu.dnetlib.dhp.schema.common.ModelConstants.CROSSREF_ID;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Optional;
import java.util.stream.Collectors;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Result;
public class ResultTypeComparator implements Comparator<Result> {
public static final ResultTypeComparator INSTANCE = new ResultTypeComparator();
@Override
public int compare(Result left, Result right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
HashSet<String> lCf = getCollectedFromIds(left);
HashSet<String> rCf = getCollectedFromIds(right);
if (lCf.contains(CROSSREF_ID) && !rCf.contains(CROSSREF_ID)) {
return -1;
}
if (!lCf.contains(CROSSREF_ID) && rCf.contains(CROSSREF_ID)) {
return 1;
}
String lClass = left.getResulttype().getClassid();
String rClass = right.getResulttype().getClassid();
if (!lClass.equals(rClass)) {
if (lClass.equals(ModelConstants.PUBLICATION_RESULTTYPE_CLASSID))
return -1;
if (rClass.equals(ModelConstants.PUBLICATION_RESULTTYPE_CLASSID))
return 1;
if (lClass.equals(ModelConstants.DATASET_RESULTTYPE_CLASSID))
return -1;
if (rClass.equals(ModelConstants.DATASET_RESULTTYPE_CLASSID))
return 1;
if (lClass.equals(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID))
return -1;
if (rClass.equals(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID))
return 1;
if (lClass.equals(ModelConstants.ORP_RESULTTYPE_CLASSID))
return -1;
if (rClass.equals(ModelConstants.ORP_RESULTTYPE_CLASSID))
return 1;
}
// Else (but unlikely), lexicographical ordering will do.
return lClass.compareTo(rClass);
}
protected HashSet<String> getCollectedFromIds(Result left) {
return Optional
.ofNullable(left.getCollectedfrom())
.map(
cf -> cf
.stream()
.map(KeyValue::getKey)
.collect(Collectors.toCollection(HashSet::new)))
.orElse(new HashSet<>());
}
}

View File

@ -0,0 +1,20 @@
[
{
"paramName": "issm",
"paramLongName": "isSparkSessionManaged",
"paramDescription": "when true will stop SparkSession after job execution",
"paramRequired": false
},
{
"paramName": "hmu",
"paramLongName": "hiveMetastoreUris",
"paramDescription": "the hive metastore uris",
"paramRequired": true
},
{
"paramName": "sql",
"paramLongName": "sql",
"paramDescription": "sql script to execute",
"paramRequired": true
}
]

View File

@ -1,5 +1,8 @@
package eu.dnetlib.dhp.application package eu.dnetlib.dhp.application
import eu.dnetlib.dhp.common.Constants
import eu.dnetlib.dhp.utils.DHPUtils.writeHdfsFile
import scala.io.Source import scala.io.Source
/** This is the main Interface SparkApplication /** This is the main Interface SparkApplication
@ -70,4 +73,13 @@ abstract class AbstractScalaApplication(
.getOrCreate() .getOrCreate()
} }
def reportTotalSize(targetPath: String, outputBasePath: String): Unit = {
val total_items = spark.read.text(targetPath).count()
writeHdfsFile(
spark.sparkContext.hadoopConfiguration,
s"$total_items",
outputBasePath + Constants.MDSTORE_SIZE_PATH
)
}
} }

View File

@ -0,0 +1,21 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Set;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
class BlackListProviderTest {
@Test
void blackListTest() {
Assertions.assertNotNull(PidBlacklistProvider.getBlacklist());
Assertions.assertNotNull(PidBlacklistProvider.getBlacklist().get("doi"));
Assertions.assertTrue(PidBlacklistProvider.getBlacklist().get("doi").size() > 0);
final Set<String> xxx = PidBlacklistProvider.getBlacklist("xxx");
Assertions.assertNotNull(xxx);
Assertions.assertEquals(0, xxx.size());
}
}

View File

@ -0,0 +1,87 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;
import java.io.IOException;
import org.apache.commons.io.IOUtils;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.oaf.Publication;
class IdentifierFactoryTest {
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
@Test
void testCreateIdentifierForPublication() throws IOException {
verifyIdentifier(
"publication_doi1.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
verifyIdentifier(
"publication_doi2.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
verifyIdentifier(
"publication_doi3.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true);
verifyIdentifier(
"publication_doi4.json", "50|od______2852::38861c44e6052a8d49f59a4c39ba5e66", true);
verifyIdentifier(
"publication_doi5.json", "50|doi_________::3bef95c0ca26dd55451fc8839ea69d27", true);
verifyIdentifier(
"publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", true);
verifyIdentifier(
"publication_pmc2.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true);
verifyIdentifier(
"publication_openapc.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
final String defaultID = "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f";
verifyIdentifier("publication_3.json", defaultID, true);
verifyIdentifier("publication_4.json", defaultID, true);
verifyIdentifier("publication_5.json", defaultID, true);
}
@Test
void testCreateIdentifierForPublicationNoHash() throws IOException {
verifyIdentifier("publication_doi1.json", "50|doi_________::10.1016/j.cmet.2010.03.013", false);
verifyIdentifier("publication_doi2.json", "50|doi_________::10.1016/j.cmet.2010.03.013", false);
verifyIdentifier("publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", false);
verifyIdentifier(
"publication_urn1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", false);
final String defaultID = "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f";
verifyIdentifier("publication_3.json", defaultID, false);
verifyIdentifier("publication_4.json", defaultID, false);
verifyIdentifier("publication_5.json", defaultID, false);
}
@Test
void testCreateIdentifierForROHub() throws IOException {
verifyIdentifier(
"orp-rohub.json", "50|w3id________::afc7592914ae190a50570db90f55f9c2", true);
}
protected void verifyIdentifier(String filename, String expectedID, boolean md5) throws IOException {
final String json = IOUtils.toString(getClass().getResourceAsStream(filename));
final Publication pub = OBJECT_MAPPER.readValue(json, Publication.class);
String id = IdentifierFactory.createIdentifier(pub, md5);
System.out.println(id);
assertNotNull(id);
assertEquals(expectedID, id);
}
}

View File

@ -0,0 +1,130 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.*;
import static org.junit.jupiter.api.Assertions.assertEquals;
import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.HashSet;
import java.util.List;
import java.util.stream.Collectors;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.commons.io.IOUtils;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Lists;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
public class MergeUtilsTest {
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
@Test
void testMergePubs_new() throws IOException {
Publication pt = read("publication_test.json", Publication.class);
Publication p1 = read("publication_test.json", Publication.class);
assertEquals(1, pt.getCollectedfrom().size());
assertEquals(ModelConstants.CROSSREF_ID, pt.getCollectedfrom().get(0).getKey());
Instance i = new Instance();
i.setUrl(Lists.newArrayList("https://..."));
p1.getInstance().add(i);
Publication ptp1 = MergeUtils.mergePublication(pt, p1);
assertNotNull(ptp1.getInstance());
assertEquals(2, ptp1.getInstance().size());
}
@Test
void testMergePubs() throws IOException {
Publication p1 = read("publication_1.json", Publication.class);
Publication p2 = read("publication_2.json", Publication.class);
Dataset d1 = read("dataset_1.json", Dataset.class);
Dataset d2 = read("dataset_2.json", Dataset.class);
assertEquals(1, p1.getCollectedfrom().size());
assertEquals(ModelConstants.CROSSREF_ID, p1.getCollectedfrom().get(0).getKey());
assertEquals(1, d2.getCollectedfrom().size());
assertFalse(cfId(d2.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
assertEquals(1, p2.getCollectedfrom().size());
assertFalse(cfId(p2.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
assertEquals(1, d1.getCollectedfrom().size());
assertTrue(cfId(d1.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
final Result p1d2 = MergeUtils.checkedMerge(p1, d2);
assertEquals(ModelConstants.PUBLICATION_RESULTTYPE_CLASSID, p1d2.getResulttype().getClassid());
assertTrue(p1d2 instanceof Publication);
assertEquals(p1.getId(), p1d2.getId());
}
@Test
void testMergePubs_1() throws IOException {
Publication p2 = read("publication_2.json", Publication.class);
Dataset d1 = read("dataset_1.json", Dataset.class);
final Result p2d1 = MergeUtils.checkedMerge(p2, d1);
assertEquals((ModelConstants.DATASET_RESULTTYPE_CLASSID), p2d1.getResulttype().getClassid());
assertTrue(p2d1 instanceof Dataset);
assertEquals(d1.getId(), p2d1.getId());
assertEquals(2, p2d1.getCollectedfrom().size());
}
@Test
void testMergePubs_2() throws IOException {
Publication p1 = read("publication_1.json", Publication.class);
Publication p2 = read("publication_2.json", Publication.class);
Result p1p2 = MergeUtils.checkedMerge(p1, p2);
assertTrue(p1p2 instanceof Publication);
assertEquals(p1.getId(), p1p2.getId());
assertEquals(2, p1p2.getCollectedfrom().size());
}
@Test
void testDelegatedAuthority_1() throws IOException {
Dataset d1 = read("dataset_2.json", Dataset.class);
Dataset d2 = read("dataset_delegated.json", Dataset.class);
assertEquals(1, d2.getCollectedfrom().size());
assertTrue(cfId(d2.getCollectedfrom()).contains(ModelConstants.ZENODO_OD_ID));
Result res = (Result) MergeUtils.merge(d1, d2, true);
assertEquals(d2, res);
}
@Test
void testDelegatedAuthority_2() throws IOException {
Dataset p1 = read("publication_1.json", Dataset.class);
Dataset d2 = read("dataset_delegated.json", Dataset.class);
assertEquals(1, d2.getCollectedfrom().size());
assertTrue(cfId(d2.getCollectedfrom()).contains(ModelConstants.ZENODO_OD_ID));
Result res = (Result) MergeUtils.merge(p1, d2, true);
assertEquals(d2, res);
}
protected HashSet<String> cfId(List<KeyValue> collectedfrom) {
return collectedfrom.stream().map(KeyValue::getKey).collect(Collectors.toCollection(HashSet::new));
}
protected <T extends Result> T read(String filename, Class<T> clazz) throws IOException {
final String json = IOUtils.toString(getClass().getResourceAsStream(filename));
return OBJECT_MAPPER.readValue(json, clazz);
}
}

View File

@ -149,7 +149,7 @@ class OafMapperUtilsTest {
void testDate() { void testDate() {
final String date = GraphCleaningFunctions.cleanDate("23-FEB-1998"); final String date = GraphCleaningFunctions.cleanDate("23-FEB-1998");
assertNotNull(date); assertNotNull(date);
System.out.println(date); assertEquals("1998-02-23", date);
} }
@Test @Test
@ -166,8 +166,8 @@ class OafMapperUtilsTest {
assertEquals( assertEquals(
ModelConstants.PUBLICATION_RESULTTYPE_CLASSID, ModelConstants.PUBLICATION_RESULTTYPE_CLASSID,
OafMapperUtils MergeUtils
.mergeResults(p1, d2) .mergeResult(p1, d2)
.getResulttype() .getResulttype()
.getClassid()); .getClassid());
@ -178,8 +178,8 @@ class OafMapperUtilsTest {
assertEquals( assertEquals(
ModelConstants.DATASET_RESULTTYPE_CLASSID, ModelConstants.DATASET_RESULTTYPE_CLASSID,
OafMapperUtils ((Result) MergeUtils
.mergeResults(p2, d1) .merge(p2, d1))
.getResulttype() .getResulttype()
.getClassid()); .getClassid());
} }
@ -192,7 +192,7 @@ class OafMapperUtilsTest {
assertEquals(1, d2.getCollectedfrom().size()); assertEquals(1, d2.getCollectedfrom().size());
assertTrue(cfId(d2.getCollectedfrom()).contains(ModelConstants.ZENODO_OD_ID)); assertTrue(cfId(d2.getCollectedfrom()).contains(ModelConstants.ZENODO_OD_ID));
Result res = OafMapperUtils.mergeResults(d1, d2); Result res = MergeUtils.mergeResult(d1, d2);
assertEquals(d2, res); assertEquals(d2, res);

View File

@ -0,0 +1,12 @@
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0000/ra.v2i3.114::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"4.65008652949e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0000/ra.v2i3.114"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/(aj).v3i6.458::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"4.01810569717e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/(aj).v3i6.458"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/1587::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.39172290649e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/1587"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/462::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"6.33235333753e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.36"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.00285265116e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/462"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/731::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"4.01810569717e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/731"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/ijllis.v9i4.2066.g2482::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"8.48190886761e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/ijllis.v9i4.2066.g2482"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0118/alfahim.v3i1.140::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"9.88840807598e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0118/alfahim.v3i1.140"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0166/fk2.stagefigshare.6442896.v3::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"7.28336930301e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0166/fk2.stagefigshare.6442896.v3"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0301/jttb.v2i1.64::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"7.28336930301e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0301/jttb.v2i1.64"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0809/seruni.v1i1.567::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"2.62959564033e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0809/seruni.v1i1.567"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0809/seruni.v2i1.765::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"9.40178571921e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0559872"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"3.67659957614e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0809/seruni.v2i1.765"}]}]}
{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0901/jkip.v7i3.485::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"6.26204125721e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0901/jkip.v7i3.485"}]}]}

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1 @@
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f","pid":[{"qualifier":{"classid":"scp-number"},"value":"79953761260"}]}

View File

@ -0,0 +1 @@
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f","pid":[]}

View File

@ -0,0 +1 @@
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f"}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,33 @@
{
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
"instance": [
{
"collectedfrom": {
"key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value": "Crossref"
},
"pid": [
{
"qualifier": {"classid": "doi"},
"value": "10.1016/j.cmet.2010.03.013"
}
]
},
{
"pid": [
{
"qualifier": {"classid": "urn"},
"value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
},
{
"qualifier": {"classid": "scp-number"},
"value": "79953761260"
},
{
"qualifier": {"classid": "pmc"},
"value": "21459329"
}
]
}
]
}

View File

@ -0,0 +1,37 @@
{
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
"instance": [
{
"collectedfrom": {
"key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value": "Crossref"
},
"pid": [
{
"qualifier": {"classid": "doi"},
"value": "10.1016/j.cmet.2010.03.013"
}
]
},
{
"collectedfrom": {
"key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
"value": "Europe PubMed Central"
},
"pid": [
{
"qualifier": {"classid": "urn"},
"value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
},
{
"qualifier": {"classid": "scp-number"},
"value": "79953761260"
},
{
"qualifier": {"classid": "pmc"},
"value": "21459329"
}
]
}
]
}

View File

@ -0,0 +1,37 @@
{
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
"instance": [
{
"collectedfrom": {
"key": "10|opendoar____::358aee4cc897452c00244351e4d91f69",
"value": "Zenodo"
},
"pid": [
{
"qualifier": {"classid": "doi"},
"value": "10.1016/j.cmet.2010.03.013"
}
]
},
{
"collectedfrom": {
"key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
"value": "Europe PubMed Central"
},
"pid": [
{
"qualifier": {"classid": "urn"},
"value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
},
{
"qualifier": {"classid": "scp-number"},
"value": "79953761260"
},
{
"qualifier": {"classid": "pmc"},
"value": "21459329"
}
]
}
]
}

View File

@ -0,0 +1,37 @@
{
"id": "50|od______2852::38861c44e6052a8d49f59a4c39ba5e66",
"instance": [
{
"collectedfrom": {
"key": "10|opendoar____::358aee4cc897452c00244351e4d91f69",
"value": "Zenodo"
},
"pid": [
{
"qualifier": {"classid": "doi"},
"value": "10.1016/j.cmet.2010.03.013"
},
{
"qualifier": {"classid": "handle"},
"value": "11012/83840"
}
]
},
{
"collectedfrom": {
"key": "10|opendoar____::2852",
"value": "Digital library of Brno University of Technology"
},
"pid": [
{
"qualifier": {"classid": "pmc"},
"value": "21459329"
},
{
"qualifier": {"classid": "handle"},
"value": "11012/83840"
}
]
}
]
}

View File

@ -0,0 +1,37 @@
{
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
"instance": [
{
"collectedfrom": {
"key": "10|opendoar____::358aee4cc897452c00244351e4d91f69",
"value": "Zenodo"
},
"pid": [
{
"qualifier": {"classid": "doi"},
"value": "10.5281/zenodo.5121485"
}
]
},
{
"collectedfrom": {
"key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
"value": "Europe PubMed Central"
},
"pid": [
{
"qualifier": {"classid": "urn"},
"value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
},
{
"qualifier": {"classid": "scp-number"},
"value": "79953761260"
},
{
"qualifier": {"classid": "pmc"},
"value": "21459329"
}
]
}
]
}

View File

@ -0,0 +1,3 @@
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", "resulttype" : { "classid" : "publication" }, "pid":[{"qualifier":{"classid":"doi"},"value":"10.1016/j.cmet.2011.03.013"},{"qualifier":{"classid":"urn"},"value":"urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"},{"qualifier":{"classid":"scp-number"},"value":"79953761260"},{"qualifier":{"classid":"pmc"},"value":"21459329"}], "collectedfrom" : [ { "key" : "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2", "value" : "Crossref"} ], "isGreen": null, "openAccessColor": "gold", "isInDiamondJournal": null, "publiclyFunded": null}
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g", "resulttype" : { "classid" : "publication" }, "isGreen": true, "openAccessColor": "gold", "isInDiamondJournal": true, "publiclyFunded": false }
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1h", "resulttype" : { "classid" : "publication" }, "isGreen": false, "openAccessColor": null, "isInDiamondJournal": true, "publiclyFunded": false }

View File

@ -0,0 +1,3 @@
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", "resulttype" : { "classid" : "publication" }, "pid":[{"qualifier":{"classid":"doi"},"value":"10.1016/j.cmet.2011.03.013"},{"qualifier":{"classid":"urn"},"value":"urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"},{"qualifier":{"classid":"scp-number"},"value":"79953761260"},{"qualifier":{"classid":"pmc"},"value":"21459329"}], "collectedfrom" : [ { "key" : "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2", "value" : "Crossref"} ], "isGreen": null, "openAccessColor": "gold", "isInDiamondJournal": null, "publiclyFunded": null}
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g", "resulttype" : { "classid" : "publication" }, "isGreen": true, "openAccessColor": "bronze", "isInDiamondJournal": true, "publiclyFunded": false }
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1h", "resulttype" : { "classid" : "publication" }, "isGreen": false, "openAccessColor": null, "isInDiamondJournal": true, "publiclyFunded": false }

View File

@ -0,0 +1,31 @@
{
"id": "50|openapc_____::000023f9cb6e3a247c764daec4273cbc",
"resuttype": {
"classid": "publication"
},
"instance": [
{
"collectedfrom": {
"key": "10|apc_________::e2b1600b229fc30663c8a1f662debddf",
"value": "OpenAPC Global Initiative"
},
"pid": [
{
"qualifier": {"classid": "doi"},
"value": "10.1016/j.cmet.2010.03.013"
},
{
"qualifier": {"classid": "pmc"},
"value": "21459329"
},
{
"qualifier": {"classid": "pmid"},
"value": "25811027"
}
],
"url":["https://doi.org/10.1155/2015/439379"]
}
]
}

View File

@ -0,0 +1,17 @@
{
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
"pid": [
{
"qualifier": {"classid": "urn"},
"value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
},
{
"qualifier": {"classid": "scp-number"},
"value": "79953761260"
},
{
"qualifier": {"classid": "pmc"},
"value": "21459329"
}
]
}

View File

@ -0,0 +1,21 @@
{
"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
"instance": [
{
"collectedfrom": {
"key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
"value": "Europe PubMed Central"
},
"pid": [
{
"qualifier": {"classid": "doi"},
"value": "10.1016/j.cmet.2010.03.013"
},
{
"qualifier":{"classid":"pmc"},
"value":"21459329"
}
]
}
]
}

View File

@ -0,0 +1,428 @@
{
"author": [
{
"affiliation": null,
"fullname": "Deymier, Ghislaine",
"name": "Ghislaine",
"pid": [],
"rank": 1,
"surname": "Deymier"
},
{
"affiliation": null,
"fullname": "Gaschet, Frédéric",
"name": "Frédéric",
"pid": [],
"rank": 2,
"surname": "Gaschet"
},
{
"affiliation": null,
"fullname": "Pouyanne, Guillaume",
"name": "Guillaume",
"pid": [],
"rank": 3,
"surname": "Pouyanne"
}
],
"bestaccessright": {
"classid": "OPEN",
"classname": "Open Access",
"schemeid": "dnet:access_modes",
"schemename": "dnet:access_modes"
},
"collectedfrom": [
{
"key" : "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value" : "Crossref"
}
],
"context": [],
"contributor": [],
"country": [],
"coverage": [],
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"dateofacceptance": {
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"value": "2013-11-30"
},
"dateofcollection": "2024-02-28T00:22:13+0000",
"dateoftransformation": "2024-03-06T08:43:13.253Z",
"description": [
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"value": "For analyzing the reciprocal interaction between urban sprawl and car use, research has first focused on the link between urban density and mobility. By looking for a reduction in energy consumption, cities have favoured a compact planning development. Then reflection has broadened from the simple density to the wider, multi-dimensional concept of urban form. This controversy has led to a renewal of analysis in term of the costs of urban growth, notably by comparing the costs of \"compact\" and \"sprawled\" development. The idea is to compare the mobility costs of different urban forms. However, most often because of a lack of data, such studies are scarce. This paper suggests an innovative method to compute mobility costs at an infra-urban scale : The Spatialized Travel Account (STA). It is based on the CERTU's travel account methodology at a metropolitan scale. It puts forward an accurate estimate of the mobility costs for each transport mode (individual and public) and for each type of payer (households, firms, local authorities...). In order to test the relationships between mobility costs and urban form, we link the computed costs to morphological characteristics of infra-urban zones, taking in account sociodemographic characteristics of households."
},
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"value": "L'interaction réciproque entre étalement urbain et usage de l'automobile a conduit la recherche à se focaliser sur le lien entre les densités urbaines et la mobilité. En cherchant à réduire leur consommation d'énergie pour les transports, et donc leurs émissions de Gaz à Effet de Serre, les villes ont alors cherché à planifier la \" ville compacte \", privilégiant notamment la reconstruction de la ville sur elle-même et la densification. Par la suite, la réflexion s'est élargie de la simple densité à la notion de forme urbaine et à toutes ses dimensions. Cette controverse devait conduire à un renouveau des analyses en termes de coûts de la croissance urbaine : le débat reste vif, encore aujourd'hui, sur les coûts comparés de la ville étalée et de la ville compacte. Plus largement, il s'agit d'explorer les coûts des différentes formes urbaines en termes de mobilité. Malgré cela, généralement pour des raisons de disponibilité de données, les études sur le sujet restent extrêmement rares. Cet article propose un outil novateur pour mesurer les coûts de la mobilité à l'échelle intraurbaine : le Compte Déplacements Territorialisé (CDT). Il s'inspire de la méthode développée par le CERTU pour l'établissement des Comptes Déplacements Voyageurs à l'échelle métropolitaine. Le CDT propose, pour chacune des zones de l'agglomération, une estimation précise de l'ensemble des coûts liés aux déplacements de personnes, ventilés par mode de transport (individuels et collectifs) et par type de financeurs (ménages, entreprises, collectivités territoriales, etc.). Nous proposons une application de cette méthode à la controverse sur le lien entre forme urbaine et coûts de la mobilité. Les coûts sont reliés aux caractéristiques morphologiques des zones (en termes de densité et de diversité, notamment), en prenant soin de contrôler les facteurs socio-économiques qui influent traditionnellement sur les comportements de mobilité (taille du ménage, revenu, etc.)."
}
],
"eoscifguidelines": [],
"externalReference": [],
"extraInfo": [],
"format": [
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"value": "application/pdf"
}
],
"fulltext": [],
"id": "50|06cdd3ff4700::4826ac62a11a957fe332e2c291dcfcca",
"instance": [
{
"accessright": {
"classid": "OPEN",
"classname": "Open Access",
"schemeid": "dnet:access_modes",
"schemename": "dnet:access_modes"
},
"alternateIdentifier": [
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "doi",
"classname": "Digital Object Identifier",
"schemeid": "dnet:pid_types",
"schemename": "dnet:pid_types"
},
"value": "10.46298/cst.12132"
}
],
"collectedfrom": {
"key": "10|openaire____::6824b298c96ba906a3e6a70593affbf5",
"value": "Episciences"
},
"dateofacceptance": {
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"value": "2013-11-30"
},
"hostedby": {
"key": "10|openaire____::6824b298c96ba906a3e6a70593affbf5",
"value": "Episciences"
},
"instanceTypeMapping": [
{
"originalType": "http://purl.org/coar/resource_type/c_6501",
"typeCode": "http://purl.org/coar/resource_type/c_6501",
"typeLabel": "journal article",
"vocabularyName": "openaire::coar_resource_types_3_1"
},
{
"originalType": "http://purl.org/coar/resource_type/c_6501",
"typeCode": "Article",
"typeLabel": "Article",
"vocabularyName": "openaire::user_resource_types"
}
],
"instancetype": {
"classid": "0001",
"classname": "Article",
"schemeid": "dnet:publication_resource",
"schemename": "dnet:publication_resource"
},
"license": {
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"value": "CC BY NC SA"
},
"pid": [],
"refereed": {
"classid": "0002",
"classname": "nonPeerReviewed",
"schemeid": "dnet:review_levels",
"schemename": "dnet:review_levels"
},
"url": [
"https://doi.org/10.46298/cst.12132",
"https://cst.episciences.org/12132"
]
}
],
"language": {
"classid": "fra/fre",
"classname": "French",
"schemeid": "dnet:languages",
"schemename": "dnet:languages"
},
"lastupdatetimestamp": 1710636106633,
"metaResourceType": {
"classid": "Research Literature",
"classname": "Research Literature",
"schemeid": "openaire::meta_resource_types",
"schemename": "openaire::meta_resource_types"
},
"originalId": [
"oai:episciences.org:cst:12132",
"50|06cdd3ff4700::4826ac62a11a957fe332e2c291dcfcca"
],
"pid": [],
"publisher": {
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"value": "episciences.org"
},
"relevantdate": [
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "Accepted",
"classname": "Accepted",
"schemeid": "dnet:dataCite_date",
"schemename": "dnet:dataCite_date"
},
"value": "2024-02-11"
},
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "issued",
"classname": "issued",
"schemeid": "dnet:dataCite_date",
"schemename": "dnet:dataCite_date"
},
"value": "2013-11-30"
},
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "available",
"classname": "available",
"schemeid": "dnet:dataCite_date",
"schemename": "dnet:dataCite_date"
},
"value": "2013-11-30"
}
],
"resourcetype": {
"classid": "journal article",
"classname": "journal article",
"schemeid": "dnet:dataCite_resource",
"schemename": "dnet:dataCite_resource"
},
"resulttype": {
"classid": "publication",
"classname": "publication",
"schemeid": "dnet:result_typologies",
"schemename": "dnet:result_typologies"
},
"source": [],
"subject": [
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "keyword",
"classname": "keyword",
"schemeid": "dnet:subject_classification_typologies",
"schemename": "dnet:subject_classification_typologies"
},
"value": "JEL: H - Public Economics/H.H7 - State and Local Government • Intergovernmental Relations/H.H7.H72 - State and Local Budget and Expenditures"
},
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "keyword",
"classname": "keyword",
"schemeid": "dnet:subject_classification_typologies",
"schemename": "dnet:subject_classification_typologies"
},
"value": "Local public finance"
},
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "keyword",
"classname": "keyword",
"schemeid": "dnet:subject_classification_typologies",
"schemename": "dnet:subject_classification_typologies"
},
"value": "JEL: R - Urban, Rural, Regional, Real Estate, and Transportation Economics/R.R5 - Regional Government Analysis/R.R5.R51 - Finance in Urban and Rural Economies"
}
],
"title": [
{
"dataInfo": {
"deletedbyinference": false,
"inferred": false,
"invisible": false,
"provenanceaction": {
"classid": "sysimport:crosswalk:repository",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
},
"trust": "0.9"
},
"qualifier": {
"classid": "main title",
"classname": "main title",
"schemeid": "dnet:dataCite_title",
"schemename": "dnet:dataCite_title"
},
"value": "Urban form and the costs of daily mobility. The spatialized travel account tool and its application to the Bordeaux metropolitan area"
}
]
}

View File

@ -0,0 +1,23 @@
{
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
"pid": [
{
"qualifier": {
"classid": "urn"
},
"value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
},
{
"qualifier": {
"classid": "scp-number"
},
"value": "79953761260"
},
{
"qualifier": {
"classid": "pmcid"
},
"value": "21459329"
}
]
}

File diff suppressed because one or more lines are too long

View File

@ -14,9 +14,9 @@ import eu.dnetlib.pace.config.Config;
public abstract class AbstractClusteringFunction extends AbstractPaceFunctions implements ClusteringFunction { public abstract class AbstractClusteringFunction extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Integer> params; protected Map<String, Object> params;
public AbstractClusteringFunction(final Map<String, Integer> params) { public AbstractClusteringFunction(final Map<String, Object> params) {
this.params = params; this.params = params;
} }
@ -27,7 +27,7 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
return fields return fields
.stream() .stream()
.filter(f -> !f.isEmpty()) .filter(f -> !f.isEmpty())
.map(this::normalize) .map(s -> normalize(s))
.map(s -> filterAllStopWords(s)) .map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s)) .map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist)) .map(c -> filterBlacklisted(c, ngramBlacklist))
@ -36,11 +36,24 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
.collect(Collectors.toCollection(HashSet::new)); .collect(Collectors.toCollection(HashSet::new));
} }
public Map<String, Integer> getParams() { public Map<String, Object> getParams() {
return params; return params;
} }
protected Integer param(String name) { protected Integer param(String name) {
return params.get(name); Object val = params.get(name);
if (val == null)
return null;
if (val instanceof Number) {
return ((Number) val).intValue();
}
return Integer.parseInt(val.toString());
}
protected int paramOrDefault(String name, int i) {
Integer res = param(name);
if (res == null)
res = i;
return res;
} }
} }

View File

@ -13,7 +13,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("acronyms") @ClusteringClass("acronyms")
public class Acronyms extends AbstractClusteringFunction { public class Acronyms extends AbstractClusteringFunction {
public Acronyms(Map<String, Integer> params) { public Acronyms(Map<String, Object> params) {
super(params); super(params);
} }

View File

@ -11,6 +11,6 @@ public interface ClusteringFunction {
public Collection<String> apply(Config config, List<String> fields); public Collection<String> apply(Config config, List<String> fields);
public Map<String, Integer> getParams(); public Map<String, Object> getParams();
} }

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("immutablefieldvalue") @ClusteringClass("immutablefieldvalue")
public class ImmutableFieldValue extends AbstractClusteringFunction { public class ImmutableFieldValue extends AbstractClusteringFunction {
public ImmutableFieldValue(final Map<String, Integer> params) { public ImmutableFieldValue(final Map<String, Object> params) {
super(params); super(params);
} }

View File

@ -0,0 +1,69 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import com.jayway.jsonpath.Configuration;
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import com.jayway.jsonpath.Option;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.util.MapDocumentUtil;
@ClusteringClass("jsonlistclustering")
public class JSONListClustering extends AbstractPaceFunctions implements ClusteringFunction {
private Map<String, Object> params;
public JSONListClustering(Map<String, Object> params) {
this.params = params;
}
@Override
public Map<String, Object> getParams() {
return params;
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(s -> doApply(conf, s))
.filter(StringUtils::isNotBlank)
.collect(Collectors.toCollection(HashSet::new));
}
private String doApply(Config conf, String json) {
StringBuilder st = new StringBuilder(); // to build the string used for comparisons basing on the jpath into
// parameters
final DocumentContext documentContext = JsonPath
.using(Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS))
.parse(json);
// for each path in the param list
for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
String path = params.get(key).toString();
String value = MapDocumentUtil.getJPathString(path, documentContext);
if (value == null || value.isEmpty())
value = "";
st.append(value);
st.append(" ");
}
st.setLength(st.length() - 1);
if (StringUtils.isBlank(st)) {
return "1";
}
return st.toString();
}
}

View File

@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("keywordsclustering") @ClusteringClass("keywordsclustering")
public class KeywordsClustering extends AbstractClusteringFunction { public class KeywordsClustering extends AbstractClusteringFunction {
public KeywordsClustering(Map<String, Integer> params) { public KeywordsClustering(Map<String, Object> params) {
super(params); super(params);
} }
@ -19,8 +19,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
protected Collection<String> doApply(final Config conf, String s) { protected Collection<String> doApply(final Config conf, String s) {
// takes city codes and keywords codes without duplicates // takes city codes and keywords codes without duplicates
Set<String> keywords = getKeywords(s, conf.translationMap(), params.getOrDefault("windowSize", 4)); Set<String> keywords = getKeywords(s, conf.translationMap(), paramOrDefault("windowSize", 4));
Set<String> cities = getCities(s, params.getOrDefault("windowSize", 4)); Set<String> cities = getCities(s, paramOrDefault("windowSize", 4));
// list of combination to return as result // list of combination to return as result
final Collection<String> combinations = new LinkedHashSet<String>(); final Collection<String> combinations = new LinkedHashSet<String>();
@ -28,7 +28,7 @@ public class KeywordsClustering extends AbstractClusteringFunction {
for (String keyword : keywordsToCodes(keywords, conf.translationMap())) { for (String keyword : keywordsToCodes(keywords, conf.translationMap())) {
for (String city : citiesToCodes(cities)) { for (String city : citiesToCodes(cities)) {
combinations.add(keyword + "-" + city); combinations.add(keyword + "-" + city);
if (combinations.size() >= params.getOrDefault("max", 2)) { if (combinations.size() >= paramOrDefault("max", 2)) {
return combinations; return combinations;
} }
} }
@ -42,8 +42,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
return fields return fields
.stream() .stream()
.filter(f -> !f.isEmpty()) .filter(f -> !f.isEmpty())
.map(this::cleanup) .map(KeywordsClustering::cleanup)
.map(this::normalize) .map(KeywordsClustering::normalize)
.map(s -> filterAllStopWords(s)) .map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s)) .map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist)) .map(c -> filterBlacklisted(c, ngramBlacklist))

View File

@ -16,7 +16,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = true; private boolean DEFAULT_AGGRESSIVE = true;
public LastNameFirstInitial(final Map<String, Integer> params) { public LastNameFirstInitial(final Map<String, Object> params) {
super(params); super(params);
} }
@ -25,7 +25,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
return fields return fields
.stream() .stream()
.filter(f -> !f.isEmpty()) .filter(f -> !f.isEmpty())
.map(this::normalize) .map(LastNameFirstInitial::normalize)
.map(s -> doApply(conf, s)) .map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist)) .map(c -> filterBlacklisted(c, ngramBlacklist))
.flatMap(c -> c.stream()) .flatMap(c -> c.stream())
@ -33,8 +33,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
.collect(Collectors.toCollection(HashSet::new)); .collect(Collectors.toCollection(HashSet::new));
} }
@Override public static String normalize(final String s) {
protected String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s)))) return fixAliases(transliterate(nfd(unicodeNormalization(s))))
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input // do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
// strings // strings

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("lowercase") @ClusteringClass("lowercase")
public class LowercaseClustering extends AbstractClusteringFunction { public class LowercaseClustering extends AbstractClusteringFunction {
public LowercaseClustering(final Map<String, Integer> params) { public LowercaseClustering(final Map<String, Object> params) {
super(params); super(params);
} }

View File

@ -12,11 +12,11 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("ngrampairs") @ClusteringClass("ngrampairs")
public class NgramPairs extends Ngrams { public class NgramPairs extends Ngrams {
public NgramPairs(Map<String, Integer> params) { public NgramPairs(Map<String, Object> params) {
super(params, false); super(params, false);
} }
public NgramPairs(Map<String, Integer> params, boolean sorted) { public NgramPairs(Map<String, Object> params, boolean sorted) {
super(params, sorted); super(params, sorted);
} }

View File

@ -10,11 +10,11 @@ public class Ngrams extends AbstractClusteringFunction {
private final boolean sorted; private final boolean sorted;
public Ngrams(Map<String, Integer> params) { public Ngrams(Map<String, Object> params) {
this(params, false); this(params, false);
} }
public Ngrams(Map<String, Integer> params, boolean sorted) { public Ngrams(Map<String, Object> params, boolean sorted) {
super(params); super(params);
this.sorted = sorted; this.sorted = sorted;
} }

View File

@ -0,0 +1,113 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
import com.google.common.base.Splitter;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("numAuthorsTitleSuffixPrefixChain")
public class NumAuthorsTitleSuffixPrefixChain extends AbstractClusteringFunction {
public NumAuthorsTitleSuffixPrefixChain(Map<String, Object> params) {
super(params);
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
try {
int num_authors = Math.min(Integer.parseInt(fields.get(0)), 21); // SIZE threshold is 20, +1
if (num_authors > 0) {
return super.apply(conf, fields.subList(1, fields.size()))
.stream()
.map(s -> num_authors + "-" + s)
.collect(Collectors.toList());
}
} catch (NumberFormatException e) {
// missing or null authors array
}
return Collections.emptyList();
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return suffixPrefixChain(cleanup(s), param("mod"));
}
private Collection<String> suffixPrefixChain(String s, int mod) {
// create the list of words from the string (remove short words)
List<String> wordsList = Arrays
.stream(s.split(" "))
.filter(si -> si.length() > 3)
.collect(Collectors.toList());
final int words = wordsList.size();
final int letters = s.length();
// create the prefix: number of words + number of letters/mod
String prefix = words / mod + "-";
return doSuffixPrefixChain(wordsList, prefix);
}
private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
Set<String> set = Sets.newLinkedHashSet();
switch (wordsList.size()) {
case 0:
break;
case 1:
set.add(wordsList.get(0));
break;
case 2:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3));
break;
default:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3) +
suffix(wordsList.get(2), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3) +
prefix(wordsList.get(2), 3));
break;
}
return set;
}
private String suffix(String s, int len) {
return s.substring(s.length() - len);
}
private String prefix(String s, int len) {
return s.substring(0, len);
}
}

View File

@ -17,11 +17,11 @@ import eu.dnetlib.pace.model.Person;
@ClusteringClass("personClustering") @ClusteringClass("personClustering")
public class PersonClustering extends AbstractPaceFunctions implements ClusteringFunction { public class PersonClustering extends AbstractPaceFunctions implements ClusteringFunction {
private Map<String, Integer> params; private Map<String, Object> params;
private static final int MAX_TOKENS = 5; private static final int MAX_TOKENS = 5;
public PersonClustering(final Map<String, Integer> params) { public PersonClustering(final Map<String, Object> params) {
this.params = params; this.params = params;
} }
@ -77,7 +77,7 @@ public class PersonClustering extends AbstractPaceFunctions implements Clusterin
// } // }
@Override @Override
public Map<String, Integer> getParams() { public Map<String, Object> getParams() {
return params; return params;
} }

View File

@ -15,7 +15,7 @@ public class PersonHash extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = false; private boolean DEFAULT_AGGRESSIVE = false;
public PersonHash(final Map<String, Integer> params) { public PersonHash(final Map<String, Object> params) {
super(params); super(params);
} }

View File

@ -8,7 +8,7 @@ import eu.dnetlib.pace.config.Config;
public class RandomClusteringFunction extends AbstractClusteringFunction { public class RandomClusteringFunction extends AbstractClusteringFunction {
public RandomClusteringFunction(Map<String, Integer> params) { public RandomClusteringFunction(Map<String, Object> params) {
super(params); super(params);
} }

View File

@ -1,7 +1,10 @@
package eu.dnetlib.pace.clustering; package eu.dnetlib.pace.clustering;
import java.util.*; import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import com.google.common.base.Joiner; import com.google.common.base.Joiner;
import com.google.common.base.Splitter; import com.google.common.base.Splitter;
@ -12,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("sortedngrampairs") @ClusteringClass("sortedngrampairs")
public class SortedNgramPairs extends NgramPairs { public class SortedNgramPairs extends NgramPairs {
public SortedNgramPairs(Map<String, Integer> params) { public SortedNgramPairs(Map<String, Object> params) {
super(params, false); super(params, false);
} }

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("spacetrimmingfieldvalue") @ClusteringClass("spacetrimmingfieldvalue")
public class SpaceTrimmingFieldValue extends AbstractClusteringFunction { public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
public SpaceTrimmingFieldValue(final Map<String, Integer> params) { public SpaceTrimmingFieldValue(final Map<String, Object> params) {
super(params); super(params);
} }
@ -25,7 +25,7 @@ public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
res res
.add( .add(
StringUtils.isBlank(s) ? RandomStringUtils.random(getParams().get("randomLength")) StringUtils.isBlank(s) ? RandomStringUtils.random(param("randomLength"))
: s.toLowerCase().replaceAll("\\s+", "")); : s.toLowerCase().replaceAll("\\s+", ""));
return res; return res;

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("suffixprefix") @ClusteringClass("suffixprefix")
public class SuffixPrefix extends AbstractClusteringFunction { public class SuffixPrefix extends AbstractClusteringFunction {
public SuffixPrefix(Map<String, Integer> params) { public SuffixPrefix(Map<String, Object> params) {
super(params); super(params);
} }

View File

@ -15,12 +15,17 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("urlclustering") @ClusteringClass("urlclustering")
public class UrlClustering extends AbstractPaceFunctions implements ClusteringFunction { public class UrlClustering extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Integer> params; protected Map<String, Object> params;
public UrlClustering(final Map<String, Integer> params) { public UrlClustering(final Map<String, Object> params) {
this.params = params; this.params = params;
} }
@Override
public Map<String, Object> getParams() {
return params;
}
@Override @Override
public Collection<String> apply(final Config conf, List<String> fields) { public Collection<String> apply(final Config conf, List<String> fields) {
try { try {
@ -35,11 +40,6 @@ public class UrlClustering extends AbstractPaceFunctions implements ClusteringFu
} }
} }
@Override
public Map<String, Integer> getParams() {
return null;
}
private URL asUrl(String value) { private URL asUrl(String value) {
try { try {
return new URL(value); return new URL(value);

View File

@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordsStatsSuffixPrefixChain") @ClusteringClass("wordsStatsSuffixPrefixChain")
public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction { public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
public WordsStatsSuffixPrefixChain(Map<String, Integer> params) { public WordsStatsSuffixPrefixChain(Map<String, Object> params) {
super(params); super(params);
} }
@ -20,7 +20,7 @@ public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
return suffixPrefixChain(s, param("mod")); return suffixPrefixChain(s, param("mod"));
} }
private Collection<String> suffixPrefixChain(String s, int mod) { static Collection<String> suffixPrefixChain(String s, int mod) {
// create the list of words from the string (remove short words) // create the list of words from the string (remove short words)
List<String> wordsList = Arrays List<String> wordsList = Arrays
@ -38,7 +38,7 @@ public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
} }
private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) { static private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
Set<String> set = Sets.newLinkedHashSet(); Set<String> set = Sets.newLinkedHashSet();
switch (wordsList.size()) { switch (wordsList.size()) {
@ -80,12 +80,16 @@ public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
} }
private String suffix(String s, int len) { private static String suffix(String s, int len) {
return s.substring(s.length() - len); return s.substring(s.length() - len);
} }
private String prefix(String s, int len) { private static String prefix(String s, int len) {
return s.substring(0, len); return s.substring(0, len);
} }
static public void main(String[] args) {
String title = "MY LIFE AS A BOSON: THE STORY OF \"THE HIGGS\"".toLowerCase();
System.out.println(suffixPrefixChain(title, 10));
}
} }

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordssuffixprefix") @ClusteringClass("wordssuffixprefix")
public class WordsSuffixPrefix extends AbstractClusteringFunction { public class WordsSuffixPrefix extends AbstractClusteringFunction {
public WordsSuffixPrefix(Map<String, Integer> params) { public WordsSuffixPrefix(Map<String, Object> params) {
super(params); super(params);
} }

View File

@ -16,7 +16,6 @@ import org.apache.commons.lang3.StringUtils;
import com.google.common.base.Joiner; import com.google.common.base.Joiner;
import com.google.common.base.Splitter; import com.google.common.base.Splitter;
import com.google.common.collect.Iterables; import com.google.common.collect.Iterables;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets; import com.google.common.collect.Sets;
import com.ibm.icu.text.Transliterator; import com.ibm.icu.text.Transliterator;
@ -27,7 +26,7 @@ import eu.dnetlib.pace.clustering.NGramUtils;
* *
* @author claudio * @author claudio
*/ */
public abstract class AbstractPaceFunctions { public class AbstractPaceFunctions {
// city map to be used when translating the city names into codes // city map to be used when translating the city names into codes
private static Map<String, String> cityMap = AbstractPaceFunctions private static Map<String, String> cityMap = AbstractPaceFunctions
@ -62,11 +61,14 @@ public abstract class AbstractPaceFunctions {
private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})"); private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
protected String concat(final List<String> l) { private static Pattern romanNumberPattern = Pattern
.compile("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$");
protected static String concat(final List<String> l) {
return Joiner.on(" ").skipNulls().join(l); return Joiner.on(" ").skipNulls().join(l);
} }
protected String cleanup(final String s) { public static String cleanup(final String s) {
final String s1 = HTML_REGEX.matcher(s).replaceAll(""); final String s1 = HTML_REGEX.matcher(s).replaceAll("");
final String s2 = unicodeNormalization(s1.toLowerCase()); final String s2 = unicodeNormalization(s1.toLowerCase());
final String s3 = nfd(s2); final String s3 = nfd(s2);
@ -82,7 +84,7 @@ public abstract class AbstractPaceFunctions {
return s12; return s12;
} }
protected String fixXML(final String a) { protected static String fixXML(final String a) {
return a return a
.replaceAll("&ndash;", " ") .replaceAll("&ndash;", " ")
@ -91,7 +93,7 @@ public abstract class AbstractPaceFunctions {
.replaceAll("&minus;", " "); .replaceAll("&minus;", " ");
} }
protected boolean checkNumbers(final String a, final String b) { protected static boolean checkNumbers(final String a, final String b) {
final String numbersA = getNumbers(a); final String numbersA = getNumbers(a);
final String numbersB = getNumbers(b); final String numbersB = getNumbers(b);
final String romansA = getRomans(a); final String romansA = getRomans(a);
@ -99,7 +101,7 @@ public abstract class AbstractPaceFunctions {
return !numbersA.equals(numbersB) || !romansA.equals(romansB); return !numbersA.equals(numbersB) || !romansA.equals(romansB);
} }
protected String getRomans(final String s) { protected static String getRomans(final String s) {
final StringBuilder sb = new StringBuilder(); final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) { for (final String t : s.split(" ")) {
sb.append(isRoman(t) ? t : ""); sb.append(isRoman(t) ? t : "");
@ -107,13 +109,12 @@ public abstract class AbstractPaceFunctions {
return sb.toString(); return sb.toString();
} }
protected boolean isRoman(final String s) { protected static boolean isRoman(final String s) {
return s Matcher m = romanNumberPattern.matcher(s);
.replaceAll("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", "qwertyuiop") return m.matches() && m.hitEnd();
.equals("qwertyuiop");
} }
protected String getNumbers(final String s) { protected static String getNumbers(final String s) {
final StringBuilder sb = new StringBuilder(); final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) { for (final String t : s.split(" ")) {
sb.append(isNumber(t) ? t : ""); sb.append(isNumber(t) ? t : "");
@ -121,7 +122,7 @@ public abstract class AbstractPaceFunctions {
return sb.toString(); return sb.toString();
} }
public boolean isNumber(String strNum) { public static boolean isNumber(String strNum) {
if (strNum == null) { if (strNum == null) {
return false; return false;
} }
@ -147,7 +148,7 @@ public abstract class AbstractPaceFunctions {
} }
} }
protected String removeSymbols(final String s) { protected static String removeSymbols(final String s) {
final StringBuilder sb = new StringBuilder(); final StringBuilder sb = new StringBuilder();
s.chars().forEach(ch -> { s.chars().forEach(ch -> {
@ -157,11 +158,11 @@ public abstract class AbstractPaceFunctions {
return sb.toString().replaceAll("\\s+", " "); return sb.toString().replaceAll("\\s+", " ");
} }
protected boolean notNull(final String s) { protected static boolean notNull(final String s) {
return s != null; return s != null;
} }
protected String normalize(final String s) { public static String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s)))) return fixAliases(transliterate(nfd(unicodeNormalization(s))))
.toLowerCase() .toLowerCase()
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input // do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
@ -174,16 +175,16 @@ public abstract class AbstractPaceFunctions {
.trim(); .trim();
} }
public String nfd(final String s) { public static String nfd(final String s) {
return Normalizer.normalize(s, Normalizer.Form.NFD); return Normalizer.normalize(s, Normalizer.Form.NFD);
} }
public String utf8(final String s) { public static String utf8(final String s) {
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
return new String(bytes, StandardCharsets.UTF_8); return new String(bytes, StandardCharsets.UTF_8);
} }
public String unicodeNormalization(final String s) { public static String unicodeNormalization(final String s) {
Matcher m = hexUnicodePattern.matcher(s); Matcher m = hexUnicodePattern.matcher(s);
StringBuffer buf = new StringBuffer(s.length()); StringBuffer buf = new StringBuffer(s.length());
@ -195,7 +196,7 @@ public abstract class AbstractPaceFunctions {
return buf.toString(); return buf.toString();
} }
protected String filterStopWords(final String s, final Set<String> stopwords) { protected static String filterStopWords(final String s, final Set<String> stopwords) {
final StringTokenizer st = new StringTokenizer(s); final StringTokenizer st = new StringTokenizer(s);
final StringBuilder sb = new StringBuilder(); final StringBuilder sb = new StringBuilder();
while (st.hasMoreTokens()) { while (st.hasMoreTokens()) {
@ -208,7 +209,7 @@ public abstract class AbstractPaceFunctions {
return sb.toString().trim(); return sb.toString().trim();
} }
public String filterAllStopWords(String s) { public static String filterAllStopWords(String s) {
s = filterStopWords(s, stopwords_en); s = filterStopWords(s, stopwords_en);
s = filterStopWords(s, stopwords_de); s = filterStopWords(s, stopwords_de);
@ -221,7 +222,8 @@ public abstract class AbstractPaceFunctions {
return s; return s;
} }
protected Collection<String> filterBlacklisted(final Collection<String> set, final Set<String> ngramBlacklist) { protected static Collection<String> filterBlacklisted(final Collection<String> set,
final Set<String> ngramBlacklist) {
final Set<String> newset = Sets.newLinkedHashSet(); final Set<String> newset = Sets.newLinkedHashSet();
for (final String s : set) { for (final String s : set) {
if (!ngramBlacklist.contains(s)) { if (!ngramBlacklist.contains(s)) {
@ -268,7 +270,7 @@ public abstract class AbstractPaceFunctions {
return m; return m;
} }
public String removeKeywords(String s, Set<String> keywords) { public static String removeKeywords(String s, Set<String> keywords) {
s = " " + s + " "; s = " " + s + " ";
for (String k : keywords) { for (String k : keywords) {
@ -278,39 +280,39 @@ public abstract class AbstractPaceFunctions {
return s.trim(); return s.trim();
} }
public double commonElementsPercentage(Set<String> s1, Set<String> s2) { public static double commonElementsPercentage(Set<String> s1, Set<String> s2) {
double longer = Math.max(s1.size(), s2.size()); double longer = Math.max(s1.size(), s2.size());
return (double) s1.stream().filter(s2::contains).count() / longer; return (double) s1.stream().filter(s2::contains).count() / longer;
} }
// convert the set of keywords to codes // convert the set of keywords to codes
public Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) { public static Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
return keywords.stream().map(s -> translationMap.get(s)).collect(Collectors.toSet()); return keywords.stream().map(s -> translationMap.get(s)).collect(Collectors.toSet());
} }
public Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) { public static Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
return toCodes(keywords, translationMap); return toCodes(keywords, translationMap);
} }
public Set<String> citiesToCodes(Set<String> keywords) { public static Set<String> citiesToCodes(Set<String> keywords) {
return toCodes(keywords, cityMap); return toCodes(keywords, cityMap);
} }
protected String firstLC(final String s) { protected static String firstLC(final String s) {
return StringUtils.substring(s, 0, 1).toLowerCase(); return StringUtils.substring(s, 0, 1).toLowerCase();
} }
protected Iterable<String> tokens(final String s, final int maxTokens) { protected static Iterable<String> tokens(final String s, final int maxTokens) {
return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens); return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens);
} }
public String normalizePid(String pid) { public static String normalizePid(String pid) {
return DOI_PREFIX.matcher(pid.toLowerCase()).replaceAll(""); return DOI_PREFIX.matcher(pid.toLowerCase()).replaceAll("");
} }
// get the list of keywords into the input string // get the list of keywords into the input string
public Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) { public static Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
String s = s1; String s = s1;
@ -340,7 +342,7 @@ public abstract class AbstractPaceFunctions {
return codes; return codes;
} }
public Set<String> getCities(String s1, int windowSize) { public static Set<String> getCities(String s1, int windowSize) {
return getKeywords(s1, cityMap, windowSize); return getKeywords(s1, cityMap, windowSize);
} }

View File

@ -18,7 +18,7 @@ public class ClusteringDef implements Serializable {
private List<String> fields; private List<String> fields;
private Map<String, Integer> params; private Map<String, Object> params;
public ClusteringDef() { public ClusteringDef() {
} }
@ -43,11 +43,11 @@ public class ClusteringDef implements Serializable {
this.fields = fields; this.fields = fields;
} }
public Map<String, Integer> getParams() { public Map<String, Object> getParams() {
return params; return params;
} }
public void setParams(final Map<String, Integer> params) { public void setParams(final Map<String, Object> params) {
this.params = params; this.params = params;
} }

View File

@ -2,6 +2,7 @@
package eu.dnetlib.pace.model; package eu.dnetlib.pace.model;
import java.io.Serializable; import java.io.Serializable;
import java.util.HashSet;
import java.util.List; import java.util.List;
import com.fasterxml.jackson.core.JsonProcessingException; import com.fasterxml.jackson.core.JsonProcessingException;
@ -36,6 +37,16 @@ public class FieldDef implements Serializable {
*/ */
private int length = -1; private int length = -1;
private HashSet<String> filter;
private boolean sorted;
public boolean isSorted() {
return sorted;
}
private String clean;
public FieldDef() { public FieldDef() {
} }
@ -91,6 +102,30 @@ public class FieldDef implements Serializable {
this.path = path; this.path = path;
} }
public HashSet<String> getFilter() {
return filter;
}
public void setFilter(HashSet<String> filter) {
this.filter = filter;
}
public boolean getSorted() {
return sorted;
}
public void setSorted(boolean sorted) {
this.sorted = sorted;
}
public String getClean() {
return clean;
}
public void setClean(String clean) {
this.clean = clean;
}
@Override @Override
public String toString() { public String toString() {
try { try {

View File

@ -5,9 +5,9 @@ import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
import org.apache.spark.SparkContext import org.apache.spark.SparkContext
import org.apache.spark.sql.catalyst.expressions.Literal import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.expressions._ import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions.{col, lit, udf} import org.apache.spark.sql.functions.{col, desc, expr, lit, udf}
import org.apache.spark.sql.types._ import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Dataset, Row, functions} import org.apache.spark.sql.{Column, Dataset, Row, SaveMode, functions}
import java.util.function.Predicate import java.util.function.Predicate
import java.util.stream.Collectors import java.util.stream.Collectors
@ -80,6 +80,8 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*)))) .withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName)))) .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
// .withColumn("count", functions.max("position").over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName)).rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) ))
// .filter("count > 1")
if (df_with_clustering_keys == null) if (df_with_clustering_keys == null)
df_with_clustering_keys = ds df_with_clustering_keys = ds
@ -88,20 +90,44 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
} }
//TODO: analytics //TODO: analytics
/*df_with_clustering_keys.groupBy(col("clustering"), col("key"))
.agg(expr("max(count) AS size"))
.orderBy(desc("size"))
.show*/
val df_with_blocks = df_with_clustering_keys val df_with_blocks = df_with_clustering_keys
// filter out rows with position exceeding the maxqueuesize parameter // split the clustering block into smaller blocks of queuemaxsize
.filter(col("position").leq(conf.getWf.getQueueMaxSize)) .groupBy(col("clustering"), col("key"), functions.floor(col("position").divide(lit(conf.getWf.getQueueMaxSize))))
.groupBy("clustering", "key")
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block")) .agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1)) .filter(functions.size(new Column("block")).gt(1))
.union(
//adjacency blocks
df_with_clustering_keys
// filter out leading and trailing elements
.filter(col("position").gt(conf.getWf.getSlidingWindowSize/2))
//.filter(col("position").lt(col("count").minus(conf.getWf.getSlidingWindowSize/2)))
// create small blocks of records on "the border" of maxqueuesize: getSlidingWindowSize/2 elements before and after
.filter(
col("position").mod(conf.getWf.getQueueMaxSize).lt(conf.getWf.getSlidingWindowSize/2) // slice of the start of block
|| col("position").mod(conf.getWf.getQueueMaxSize).gt(conf.getWf.getQueueMaxSize - (conf.getWf.getSlidingWindowSize/2)) //slice of the end of the block
)
.groupBy(col("clustering"), col("key"), functions.floor((col("position") + lit(conf.getWf.getSlidingWindowSize/2)).divide(lit(conf.getWf.getQueueMaxSize))))
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1))
)
df_with_blocks df_with_blocks
} }
def clusterValuesUDF(cd: ClusteringDef) = { def clusterValuesUDF(cd: ClusteringDef) = {
udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => { udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala) val valueList = values.flatMap {
case a: mutable.WrappedArray[Any] => a.map(_.toString)
case s: Any => Seq(s.toString)
}.asJava;
mutable.WrappedArray.make(cd.clusteringFunction().apply(conf, valueList).toArray())
}) })
} }

View File

@ -1,12 +1,16 @@
package eu.dnetlib.pace.model package eu.dnetlib.pace.model
import com.jayway.jsonpath.{Configuration, JsonPath} import com.jayway.jsonpath.{Configuration, JsonPath}
import eu.dnetlib.pace.common.AbstractPaceFunctions
import eu.dnetlib.pace.config.{DedupConfig, Type} import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.{MapDocumentUtil, SparkCompatUtils} import eu.dnetlib.pace.util.MapDocumentUtil
import org.apache.spark.sql.{Dataset, Row} import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import org.apache.spark.sql.{Dataset, Row}
import java.util.Locale
import java.util.regex.Pattern import java.util.regex.Pattern
import scala.collection.JavaConverters._ import scala.collection.JavaConverters._
@ -48,7 +52,7 @@ case class SparkModel(conf: DedupConfig) {
val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName) val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => { val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
df.map(r => rowFromJson(r))(SparkCompatUtils.encoderFor(schema)) df.map(r => rowFromJson(r))(RowEncoder(schema))
} }
def rowFromJson(json: String): Row = { def rowFromJson(json: String): Row = {
@ -59,7 +63,7 @@ case class SparkModel(conf: DedupConfig) {
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext) values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
schema.fieldNames.zipWithIndex.foldLeft(values) { schema.fieldNames.zipWithIndex.foldLeft(values) {
case ((res, (fname, index))) => { case ((res, (fname, index))) =>
val fdef = conf.getPace.getModelMap.get(fname) val fdef = conf.getPace.getModelMap.get(fname)
if (fdef != null) { if (fdef != null) {
@ -95,13 +99,52 @@ case class SparkModel(conf: DedupConfig) {
case Type.DoubleArray => case Type.DoubleArray =>
MapDocumentUtil.getJPathArray(fdef.getPath, json) MapDocumentUtil.getJPathArray(fdef.getPath, json)
} }
val filter = fdef.getFilter
if (StringUtils.isNotBlank(fdef.getClean)) {
res(index) = res(index) match {
case x: Seq[String] => x.map(clean(_, fdef.getClean)).toSeq
case _ => clean(res(index).toString, fdef.getClean)
}
}
if (filter != null && !filter.isEmpty) {
res(index) = res(index) match {
case x: String if filter.contains(x.toLowerCase(Locale.ROOT)) => null
case x: Seq[String] => x.filter(s => !filter.contains(s.toLowerCase(Locale.ROOT))).toSeq
case _ => res(index)
}
}
if (fdef.getSorted) {
res(index) = res(index) match {
case x: Seq[String] => x.sorted.toSeq
case _ => res(index)
}
}
} }
res res
} }
}
new GenericRowWithSchema(values, schema) new GenericRowWithSchema(values, schema)
} }
def clean(value: String, cleantype: String) : String = {
val res = cleantype match {
case "title" => AbstractPaceFunctions.cleanup(value)
case _ => value
}
// if (!res.equals(AbstractPaceFunctions.normalize(value))) {
// println(res)
// println(AbstractPaceFunctions.normalize(value))
// println()
// }
res
}
} }

View File

@ -1,8 +1,10 @@
package eu.dnetlib.pace.tree; package eu.dnetlib.pace.tree;
import java.util.ArrayList;
import java.util.List; import java.util.List;
import java.util.Map; import java.util.Map;
import java.util.function.BiFunction;
import java.util.stream.Collectors; import java.util.stream.Collectors;
import com.wcohen.ss.AbstractStringDistance; import com.wcohen.ss.AbstractStringDistance;
@ -11,6 +13,7 @@ import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.model.Person; import eu.dnetlib.pace.model.Person;
import eu.dnetlib.pace.tree.support.AbstractListComparator; import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass; import eu.dnetlib.pace.tree.support.ComparatorClass;
import eu.dnetlib.pace.util.AuthorMatchers;
@ComparatorClass("authorsMatch") @ComparatorClass("authorsMatch")
public class AuthorsMatch extends AbstractListComparator { public class AuthorsMatch extends AbstractListComparator {
@ -23,7 +26,6 @@ public class AuthorsMatch extends AbstractListComparator {
private String MODE; // full or surname private String MODE; // full or surname
private int SIZE_THRESHOLD; private int SIZE_THRESHOLD;
private String TYPE; // count or percentage private String TYPE; // count or percentage
private int common;
public AuthorsMatch(Map<String, String> params) { public AuthorsMatch(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler()); super(params, new com.wcohen.ss.JaroWinkler());
@ -35,7 +37,6 @@ public class AuthorsMatch extends AbstractListComparator {
FULLNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("fullname_th", "0.9")); FULLNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("fullname_th", "0.9"));
SIZE_THRESHOLD = Integer.parseInt(params.getOrDefault("size_th", "20")); SIZE_THRESHOLD = Integer.parseInt(params.getOrDefault("size_th", "20"));
TYPE = params.getOrDefault("type", "percentage"); TYPE = params.getOrDefault("type", "percentage");
common = 0;
} }
protected AuthorsMatch(double w, AbstractStringDistance ssalgo) { protected AuthorsMatch(double w, AbstractStringDistance ssalgo) {
@ -43,23 +44,40 @@ public class AuthorsMatch extends AbstractListComparator {
} }
@Override @Override
public double compare(final List<String> a, final List<String> b, final Config conf) { public double compare(final List<String> left, final List<String> right, final Config conf) {
if (left.isEmpty() || right.isEmpty())
if (a.isEmpty() || b.isEmpty())
return -1; return -1;
if (a.size() > SIZE_THRESHOLD || b.size() > SIZE_THRESHOLD) if (left.size() > SIZE_THRESHOLD || right.size() > SIZE_THRESHOLD)
return 1.0; return 1.0;
List<Person> aList = a.stream().map(author -> new Person(author, false)).collect(Collectors.toList()); Double threshold = getDoubleParam("threshold");
int maxMiss = Integer.MAX_VALUE;
if (threshold != null && threshold >= 0.0 && threshold <= 1.0 && left.size() == right.size()) {
maxMiss = (int) Math.floor((1 - threshold) * Math.max(left.size(), right.size()));
}
int common = 0;
List<String> a = new ArrayList<>(left);
List<String> b = new ArrayList<>(right);
common += AuthorMatchers
.removeMatches(a, b, (BiFunction<String, String, Object>) AuthorMatchers::matchEqualsIgnoreCase)
.size() / 2;
common += AuthorMatchers
.removeMatches(a, b, (BiFunction<String, String, Object>) AuthorMatchers::matchOrderedTokenAndAbbreviations)
.size() / 2;
List<Person> bList = b.stream().map(author -> new Person(author, false)).collect(Collectors.toList()); List<Person> bList = b.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
common = 0;
// compare each element of List1 with each element of List2 // compare each element of List1 with each element of List2
for (Person p1 : aList) int alreadyMatched = common;
for (int i = 0; i < a.size(); i++) {
Person p1 = new Person(a.get(i), false);
for (Person p2 : bList) { for (Person p2 : bList) {
// both persons are inaccurate // both persons are inaccurate
if (!p1.isAccurate() && !p2.isAccurate()) { if (!p1.isAccurate() && !p2.isAccurate()) {
// compare just normalized fullnames // compare just normalized fullnames
@ -118,11 +136,15 @@ public class AuthorsMatch extends AbstractListComparator {
} }
} }
}
if (i - common - alreadyMatched > maxMiss) {
return 0.0;
}
} }
// normalization factor to compute the score // normalization factor to compute the score
int normFactor = aList.size() == bList.size() ? aList.size() : (aList.size() + bList.size() - common); int normFactor = left.size() == right.size() ? left.size() : (left.size() + right.size() - common);
if (TYPE.equals("percentage")) { if (TYPE.equals("percentage")) {
return (double) common / normFactor; return (double) common / normFactor;
@ -153,5 +175,4 @@ public class AuthorsMatch extends AbstractListComparator {
public String normalization(String s) { public String normalization(String s) {
return normalize(utf8(cleanup(s))); return normalize(utf8(cleanup(s)));
} }
} }

View File

@ -23,14 +23,18 @@ public class InstanceTypeMatch extends AbstractListComparator {
// jolly types // jolly types
translationMap.put("Conference object", "*"); translationMap.put("Conference object", "*");
translationMap.put("Research", "*");
translationMap.put("Other literature type", "*"); translationMap.put("Other literature type", "*");
translationMap.put("Unknown", "*"); translationMap.put("Unknown", "*");
translationMap.put("UNKNOWN", "*");
// article types // article types
translationMap.put("Article", "Article"); translationMap.put("Article", "Article");
translationMap.put("Journal", "Article");
translationMap.put("Data Paper", "Article"); translationMap.put("Data Paper", "Article");
translationMap.put("Software Paper", "Article"); translationMap.put("Software Paper", "Article");
translationMap.put("Preprint", "Article"); translationMap.put("Preprint", "Article");
translationMap.put("Part of book or chapter of book", "Article");
// thesis types // thesis types
translationMap.put("Thesis", "Thesis"); translationMap.put("Thesis", "Thesis");
@ -76,5 +80,4 @@ public class InstanceTypeMatch extends AbstractListComparator {
protected double normalize(final double d) { protected double normalize(final double d) {
return d; return d;
} }
} }

View File

@ -3,6 +3,7 @@ package eu.dnetlib.pace.tree;
import java.util.Map; import java.util.Map;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.logging.Log; import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory; import org.apache.commons.logging.LogFactory;
@ -30,16 +31,25 @@ public class LevensteinTitle extends AbstractStringComparator {
} }
@Override @Override
public double distance(final String a, final String b, final Config conf) { public double distance(final String ca, final String cb, final Config conf) {
final String ca = cleanup(a);
final String cb = cleanup(b);
final boolean check = checkNumbers(ca, cb); final boolean check = checkNumbers(ca, cb);
if (check) if (check)
return 0.5; return 0.5;
return normalize(ssalgo.score(ca, cb), ca.length(), cb.length()); Double threshold = getDoubleParam("threshold");
// reduce Levenshtein algo complexity when target threshold is known
if (threshold != null && threshold >= 0.0 && threshold <= 1.0) {
int maxdistance = (int) Math.floor((1 - threshold) * Math.max(ca.length(), cb.length()));
int score = StringUtils.getLevenshteinDistance(ca, cb, maxdistance);
if (score == -1) {
return 0;
}
return normalize(score, ca.length(), cb.length());
} else {
return normalize(StringUtils.getLevenshteinDistance(ca, cb), ca.length(), cb.length());
}
} }
private double normalize(final double score, final int la, final int lb) { private double normalize(final double score, final int la, final int lb) {

View File

@ -0,0 +1,29 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("maxLengthMatch")
public class MaxLengthMatch extends AbstractStringComparator {
private final int limit;
public MaxLengthMatch(Map<String, String> params) {
super(params);
limit = Integer.parseInt(params.getOrDefault("limit", "200"));
}
@Override
public double compare(String a, String b, final Config conf) {
return a.length() < limit && b.length() < limit ? 1.0 : -1.0;
}
protected String toString(final Object object) {
return toFirstString(object);
}
}

View File

@ -127,4 +127,14 @@ public abstract class AbstractComparator<T> extends AbstractPaceFunctions implem
return this.weight; return this.weight;
} }
public Double getDoubleParam(String name) {
String svalue = params.get(name);
try {
return Double.parseDouble(svalue);
} catch (Throwable t) {
}
return null;
}
} }

View File

@ -0,0 +1,112 @@
package eu.dnetlib.pace.util
import java.util.Locale
import java.util.regex.Pattern
import scala.util.control.Breaks.{break, breakable}
object AuthorMatchers {
val SPLIT_REGEX = Pattern.compile("[\\s,\\.]+")
val WORD_DIFF = 2
def matchEqualsIgnoreCase(a1: String, a2: String): Boolean = {
if (a1 == null || a2 == null)
false
else
a1 == a2 || a1.toLowerCase(Locale.ROOT).equals(a2.toLowerCase(Locale.ROOT))
}
def matchOtherNames(fullName: String, otherNames: Seq[String]): Boolean = {
if (otherNames != null) {
otherNames.exists(matchEqualsIgnoreCase(fullName, _))
} else {
false
}
}
def matchOrderedTokenAndAbbreviations(a1: String, a2: String): Boolean = {
val p1: Array[String] = SPLIT_REGEX.split(a1.trim.toLowerCase(Locale.ROOT)).filter(_.nonEmpty).sorted
val p2: Array[String] = SPLIT_REGEX.split(a2.trim.toLowerCase(Locale.ROOT)).filter(_.nonEmpty).sorted
if (p1.length < 2 || p2.length < 2) return false
if (Math.abs(p1.length - p2.length) > WORD_DIFF) return false // use alternative comparison algo
var p1Idx: Int = 0
var p2Idx: Int = 0
var shortMatches: Int = 0
var longMatches: Int = 0
while (p1Idx < p1.length && p2Idx < p2.length) {
val e1: String = p1(p1Idx)
val c1: Char = e1.charAt(0)
val e2: String = p2(p2Idx)
val c2: Char = e2.charAt(0)
if (c1 < c2) p1Idx += 1
else if (c1 > c2) p2Idx += 1
else {
var res: Boolean = false
if (e1.length != 1 && e2.length != 1) {
res = e1 == e2
if (res)
longMatches += 1
} else {
res = true
shortMatches += 1
}
if (res) {
p1Idx += 1
p2Idx += 1
} else {
val diff: Int = e1.compareTo(e2)
if (diff < 0) p1Idx += 1
else if (diff > 0) p2Idx += 1
}
}
}
longMatches > 0 && (shortMatches + longMatches) == Math.min(p1.length, p2.length)
}
def removeMatches(
graph_authors: java.util.List[String],
orcid_authors: java.util.List[String],
matchingFunc: java.util.function.BiFunction[String,String,Boolean]
) : java.util.List[String] = {
removeMatches(graph_authors, orcid_authors, (a, b) => matchingFunc(a,b))
}
def removeMatches(
graph_authors: java.util.List[String],
orcid_authors: java.util.List[String],
matchingFunc: (String, String) => Boolean
) : java.util.List[String] = {
val matched = new java.util.ArrayList[String]()
if (graph_authors != null && !graph_authors.isEmpty) {
val ait = graph_authors.iterator
while (ait.hasNext) {
val author = ait.next()
val oit = orcid_authors.iterator
breakable {
while (oit.hasNext) {
val orcid = oit.next()
if (matchingFunc(author, orcid)) {
ait.remove()
oit.remove()
matched.add(author)
matched.add(orcid)
break()
}
}
}
}
}
matched
}
}

View File

@ -67,8 +67,10 @@ public class BlockProcessor {
private void processRows(final List<Row> queue, final Reporter context) { private void processRows(final List<Row> queue, final Reporter context) {
for (int pivotPos = 0; pivotPos < queue.size(); pivotPos++) { IncrementalConnectedComponents icc = new IncrementalConnectedComponents(queue.size());
final Row pivot = queue.get(pivotPos);
for (int i = 0; i < queue.size(); i++) {
final Row pivot = queue.get(i);
final String idPivot = pivot.getString(identifierFieldPos); // identifier final String idPivot = pivot.getString(identifierFieldPos); // identifier
final Object fieldsPivot = getJavaValue(pivot, orderFieldPos); final Object fieldsPivot = getJavaValue(pivot, orderFieldPos);
@ -76,9 +78,9 @@ public class BlockProcessor {
final WfConfig wf = dedupConf.getWf(); final WfConfig wf = dedupConf.getWf();
if (fieldPivot != null) { if (fieldPivot != null) {
int i = 0; for (int j = icc.nextUnconnected(i, i + 1); j >= 0
for (int windowPos = pivotPos + 1; windowPos < queue.size(); windowPos++) { && j < queue.size(); j = icc.nextUnconnected(i, j + 1)) {
final Row curr = queue.get(windowPos); final Row curr = queue.get(j);
final String idCurr = curr.getString(identifierFieldPos); // identifier final String idCurr = curr.getString(identifierFieldPos); // identifier
if (mustSkip(idCurr)) { if (mustSkip(idCurr)) {
@ -86,7 +88,7 @@ public class BlockProcessor {
break; break;
} }
if (++i > wf.getSlidingWindowSize()) { if (wf.getSlidingWindowSize() > 0 && (j - i) > wf.getSlidingWindowSize()) {
break; break;
} }
@ -97,7 +99,9 @@ public class BlockProcessor {
final TreeProcessor treeProcessor = new TreeProcessor(dedupConf); final TreeProcessor treeProcessor = new TreeProcessor(dedupConf);
emitOutput(treeProcessor.compare(pivot, curr), idPivot, idCurr, context); if (emitOutput(treeProcessor.compare(pivot, curr), idPivot, idCurr, context)) {
icc.connect(i, j);
}
} }
} }
} }
@ -115,7 +119,8 @@ public class BlockProcessor {
return null; return null;
} }
private void emitOutput(final boolean result, final String idPivot, final String idCurr, final Reporter context) { private boolean emitOutput(final boolean result, final String idPivot, final String idCurr,
final Reporter context) {
if (result) { if (result) {
if (idPivot.compareTo(idCurr) <= 0) { if (idPivot.compareTo(idCurr) <= 0) {
@ -127,6 +132,8 @@ public class BlockProcessor {
} else { } else {
context.incrementCounter(dedupConf.getWf().getEntityType(), "d < " + dedupConf.getWf().getThreshold(), 1); context.incrementCounter(dedupConf.getWf().getEntityType(), "d < " + dedupConf.getWf().getThreshold(), 1);
} }
return result;
} }
private boolean mustSkip(final String idPivot) { private boolean mustSkip(final String idPivot) {
@ -142,5 +149,4 @@ public class BlockProcessor {
context.emit(type, from, to); context.emit(type, from, to);
} }
} }

View File

@ -0,0 +1,50 @@
package eu.dnetlib.pace.util;
import java.util.BitSet;
public class IncrementalConnectedComponents {
final private int size;
final private BitSet[] indexes;
IncrementalConnectedComponents(int size) {
this.size = size;
this.indexes = new BitSet[size];
}
public void connect(int i, int j) {
if (indexes[i] == null) {
if (indexes[j] == null) {
indexes[i] = new BitSet(size);
} else {
indexes[i] = indexes[j];
}
} else {
if (indexes[j] != null && indexes[i] != indexes[j]) {
// merge adjacency lists for i and j
indexes[i].or(indexes[j]);
}
}
indexes[i].set(i);
indexes[i].set(j);
indexes[j] = indexes[i];
}
public int nextUnconnected(int i, int j) {
if (indexes[i] == null) {
return j;
}
int result = indexes[i].nextClearBit(j);
return (result >= size) ? -1 : result;
}
public BitSet getConnections(int i) {
if (indexes[i] == null) {
return null;
}
return indexes[i];
}
}

View File

@ -97,6 +97,8 @@ public class MapDocumentUtil {
Object o = json.read(jsonPath); Object o = json.read(jsonPath);
if (o instanceof String) if (o instanceof String)
return (String) o; return (String) o;
if (o instanceof Number)
return (String) o.toString();
if (o instanceof JSONArray && ((JSONArray) o).size() > 0) if (o instanceof JSONArray && ((JSONArray) o).size() > 0)
return (String) ((JSONArray) o).get(0); return (String) ((JSONArray) o).get(0);
return ""; return "";

View File

@ -40,7 +40,7 @@ public class PaceResolver implements Serializable {
Collectors.toMap(cl -> cl.getAnnotation(ComparatorClass.class).value(), cl -> (Class<Comparator>) cl)); Collectors.toMap(cl -> cl.getAnnotation(ComparatorClass.class).value(), cl -> (Class<Comparator>) cl));
} }
public ClusteringFunction getClusteringFunction(String name, Map<String, Integer> params) throws PaceException { public ClusteringFunction getClusteringFunction(String name, Map<String, Object> params) throws PaceException {
try { try {
return clusteringFunctions.get(name).getDeclaredConstructor(Map.class).newInstance(params); return clusteringFunctions.get(name).getDeclaredConstructor(Map.class).newInstance(params);
} catch (InstantiationException | IllegalAccessException | InvocationTargetException } catch (InstantiationException | IllegalAccessException | InvocationTargetException

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.DedupConfig;
public class ClusteringFunctionTest extends AbstractPaceTest { public class ClusteringFunctionTest extends AbstractPaceTest {
private static Map<String, Integer> params; private static Map<String, Object> params;
private static DedupConfig conf; private static DedupConfig conf;
@BeforeAll @BeforeAll
@ -40,10 +40,10 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testNgram() { public void testNgram() {
params.put("ngramLen", 3); params.put("ngramLen", "3");
params.put("max", 8); params.put("max", "8");
params.put("maxPerToken", 2); params.put("maxPerToken", "2");
params.put("minNgramLen", 1); params.put("minNgramLen", "1");
final ClusteringFunction ngram = new Ngrams(params); final ClusteringFunction ngram = new Ngrams(params);
@ -54,8 +54,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testNgramPairs() { public void testNgramPairs() {
params.put("ngramLen", 3); params.put("ngramLen", "3");
params.put("max", 2); params.put("max", "2");
final ClusteringFunction np = new NgramPairs(params); final ClusteringFunction np = new NgramPairs(params);
@ -66,8 +66,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testSortedNgramPairs() { public void testSortedNgramPairs() {
params.put("ngramLen", 3); params.put("ngramLen", "3");
params.put("max", 2); params.put("max", "2");
final ClusteringFunction np = new SortedNgramPairs(params); final ClusteringFunction np = new SortedNgramPairs(params);
@ -87,9 +87,9 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testAcronym() { public void testAcronym() {
params.put("max", 4); params.put("max", "4");
params.put("minLen", 1); params.put("minLen", "1");
params.put("maxLen", 3); params.put("maxLen", "3");
final ClusteringFunction acro = new Acronyms(params); final ClusteringFunction acro = new Acronyms(params);
@ -100,8 +100,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testSuffixPrefix() { public void testSuffixPrefix() {
params.put("len", 3); params.put("len", "3");
params.put("max", 4); params.put("max", "4");
final ClusteringFunction sp = new SuffixPrefix(params); final ClusteringFunction sp = new SuffixPrefix(params);
@ -109,8 +109,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
System.out.println(s); System.out.println(s);
System.out.println(sp.apply(conf, Lists.newArrayList(s))); System.out.println(sp.apply(conf, Lists.newArrayList(s)));
params.put("len", 3); params.put("len", "3");
params.put("max", 1); params.put("max", "1");
System.out.println(sp.apply(conf, Lists.newArrayList("Framework for general-purpose deduplication"))); System.out.println(sp.apply(conf, Lists.newArrayList("Framework for general-purpose deduplication")));
} }
@ -118,8 +118,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testWordsSuffixPrefix() { public void testWordsSuffixPrefix() {
params.put("len", 3); params.put("len", "3");
params.put("max", 4); params.put("max", "4");
final ClusteringFunction sp = new WordsSuffixPrefix(params); final ClusteringFunction sp = new WordsSuffixPrefix(params);
@ -130,7 +130,7 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testWordsStatsSuffixPrefix() { public void testWordsStatsSuffixPrefix() {
params.put("mod", 10); params.put("mod", "10");
final ClusteringFunction sp = new WordsStatsSuffixPrefixChain(params); final ClusteringFunction sp = new WordsStatsSuffixPrefixChain(params);
@ -167,7 +167,7 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test @Test
public void testFieldValue() { public void testFieldValue() {
params.put("randomLength", 5); params.put("randomLength", "5");
final ClusteringFunction sp = new SpaceTrimmingFieldValue(params); final ClusteringFunction sp = new SpaceTrimmingFieldValue(params);

View File

@ -0,0 +1,40 @@
package eu.dnetlib.pace.util;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNull;
import org.junit.jupiter.api.Test;
public class IncrementalConnectedComponentsTest {
@Test
public void transitiveClosureTest() {
IncrementalConnectedComponents icc = new IncrementalConnectedComponents(10);
icc.connect(0, 1);
icc.connect(0, 2);
icc.connect(0, 3);
icc.connect(1, 2);
icc.connect(1, 4);
icc.connect(1, 5);
icc.connect(6, 7);
icc.connect(6, 9);
assertEquals(icc.getConnections(0).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(1).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(2).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(3).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(4).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(5).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(6).toString(), "{6, 7, 9}");
assertEquals(icc.getConnections(7).toString(), "{6, 7, 9}");
assertEquals(icc.getConnections(9).toString(), "{6, 7, 9}");
assertNull(icc.getConnections(8));
}
}

View File

@ -7,8 +7,7 @@ import java.util.function.BiFunction;
import eu.dnetlib.dhp.common.FunctionalInterfaceSupport.SerializableSupplier; import eu.dnetlib.dhp.common.FunctionalInterfaceSupport.SerializableSupplier;
import eu.dnetlib.dhp.schema.oaf.Oaf; import eu.dnetlib.dhp.schema.oaf.Oaf;
import eu.dnetlib.dhp.schema.oaf.OafEntity; import eu.dnetlib.dhp.schema.oaf.utils.MergeUtils;
import eu.dnetlib.dhp.schema.oaf.Relation;
/** OAF model merging support. */ /** OAF model merging support. */
public class MergeAndGet { public class MergeAndGet {
@ -46,20 +45,7 @@ public class MergeAndGet {
} }
private static <G extends Oaf, A extends Oaf> G mergeFromAndGet(G x, A y) { private static <G extends Oaf, A extends Oaf> G mergeFromAndGet(G x, A y) {
if (isSubClass(x, Relation.class) && isSubClass(y, Relation.class)) { return (G) MergeUtils.merge(x, y);
((Relation) x).mergeFrom((Relation) y);
return x;
} else if (isSubClass(x, OafEntity.class)
&& isSubClass(y, OafEntity.class)
&& isSubClass(x, y)) {
((OafEntity) x).mergeFrom((OafEntity) y);
return x;
}
throw new RuntimeException(
String
.format(
"MERGE_FROM_AND_GET incompatible types: %s, %s",
x.getClass().getCanonicalName(), y.getClass().getCanonicalName()));
} }
@SuppressWarnings("unchecked") @SuppressWarnings("unchecked")

View File

@ -0,0 +1,39 @@
/*
* Copyright (c) 2024.
* SPDX-FileCopyrightText: © 2023 Consiglio Nazionale delle Ricerche
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
package eu.dnetlib.dhp.actionmanager.promote;
/** Encodes the Actionset promotion strategies */
public class PromoteAction {
/** The supported actionset promotion strategies
*
* ENRICH: promotes only records in the actionset matching another record in the
* graph and enriches them applying the given MergeAndGet strategy
* UPSERT: promotes all the records in an actionset, matching records are updated
* using the given MergeAndGet strategy, the non-matching record as inserted as they are.
*/
public enum Strategy {
ENRICH, UPSERT
}
/**
* Returns the string representation of the join type implementing the given PromoteAction.
*
* @param strategy the strategy to be used to promote the Actionset contents
* @return the join type used to implement the promotion strategy
*/
public static String joinTypeForStrategy(PromoteAction.Strategy strategy) {
switch (strategy) {
case ENRICH:
return "left_outer";
case UPSERT:
return "full_outer";
default:
throw new IllegalStateException("unsupported PromoteAction: " + strategy.toString());
}
}
}

View File

@ -67,8 +67,9 @@ public class PromoteActionPayloadForGraphTableJob {
String outputGraphTablePath = parser.get("outputGraphTablePath"); String outputGraphTablePath = parser.get("outputGraphTablePath");
logger.info("outputGraphTablePath: {}", outputGraphTablePath); logger.info("outputGraphTablePath: {}", outputGraphTablePath);
MergeAndGet.Strategy strategy = MergeAndGet.Strategy.valueOf(parser.get("mergeAndGetStrategy").toUpperCase()); MergeAndGet.Strategy mergeAndGetStrategy = MergeAndGet.Strategy
logger.info("strategy: {}", strategy); .valueOf(parser.get("mergeAndGetStrategy").toUpperCase());
logger.info("mergeAndGetStrategy: {}", mergeAndGetStrategy);
Boolean shouldGroupById = Optional Boolean shouldGroupById = Optional
.ofNullable(parser.get("shouldGroupById")) .ofNullable(parser.get("shouldGroupById"))
@ -76,6 +77,12 @@ public class PromoteActionPayloadForGraphTableJob {
.orElse(true); .orElse(true);
logger.info("shouldGroupById: {}", shouldGroupById); logger.info("shouldGroupById: {}", shouldGroupById);
PromoteAction.Strategy promoteActionStrategy = Optional
.ofNullable(parser.get("promoteActionStrategy"))
.map(PromoteAction.Strategy::valueOf)
.orElse(PromoteAction.Strategy.UPSERT);
logger.info("promoteActionStrategy: {}", promoteActionStrategy);
@SuppressWarnings("unchecked") @SuppressWarnings("unchecked")
Class<? extends Oaf> rowClazz = (Class<? extends Oaf>) Class.forName(graphTableClassName); Class<? extends Oaf> rowClazz = (Class<? extends Oaf>) Class.forName(graphTableClassName);
@SuppressWarnings("unchecked") @SuppressWarnings("unchecked")
@ -97,7 +104,8 @@ public class PromoteActionPayloadForGraphTableJob {
inputGraphTablePath, inputGraphTablePath,
inputActionPayloadPath, inputActionPayloadPath,
outputGraphTablePath, outputGraphTablePath,
strategy, mergeAndGetStrategy,
promoteActionStrategy,
rowClazz, rowClazz,
actionPayloadClazz, actionPayloadClazz,
shouldGroupById); shouldGroupById);
@ -124,14 +132,16 @@ public class PromoteActionPayloadForGraphTableJob {
String inputGraphTablePath, String inputGraphTablePath,
String inputActionPayloadPath, String inputActionPayloadPath,
String outputGraphTablePath, String outputGraphTablePath,
MergeAndGet.Strategy strategy, MergeAndGet.Strategy mergeAndGetStrategy,
PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz, Class<G> rowClazz,
Class<A> actionPayloadClazz, Boolean shouldGroupById) { Class<A> actionPayloadClazz, Boolean shouldGroupById) {
Dataset<G> rowDS = readGraphTable(spark, inputGraphTablePath, rowClazz); Dataset<G> rowDS = readGraphTable(spark, inputGraphTablePath, rowClazz);
Dataset<A> actionPayloadDS = readActionPayload(spark, inputActionPayloadPath, actionPayloadClazz); Dataset<A> actionPayloadDS = readActionPayload(spark, inputActionPayloadPath, actionPayloadClazz);
Dataset<G> result = promoteActionPayloadForGraphTable( Dataset<G> result = promoteActionPayloadForGraphTable(
rowDS, actionPayloadDS, strategy, rowClazz, actionPayloadClazz, shouldGroupById) rowDS, actionPayloadDS, mergeAndGetStrategy, promoteActionStrategy, rowClazz, actionPayloadClazz,
shouldGroupById)
.map((MapFunction<G, G>) value -> value, Encoders.bean(rowClazz)); .map((MapFunction<G, G>) value -> value, Encoders.bean(rowClazz));
saveGraphTable(result, outputGraphTablePath); saveGraphTable(result, outputGraphTablePath);
@ -183,7 +193,8 @@ public class PromoteActionPayloadForGraphTableJob {
private static <G extends Oaf, A extends Oaf> Dataset<G> promoteActionPayloadForGraphTable( private static <G extends Oaf, A extends Oaf> Dataset<G> promoteActionPayloadForGraphTable(
Dataset<G> rowDS, Dataset<G> rowDS,
Dataset<A> actionPayloadDS, Dataset<A> actionPayloadDS,
MergeAndGet.Strategy strategy, MergeAndGet.Strategy mergeAndGetStrategy,
PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz, Class<G> rowClazz,
Class<A> actionPayloadClazz, Class<A> actionPayloadClazz,
Boolean shouldGroupById) { Boolean shouldGroupById) {
@ -195,8 +206,9 @@ public class PromoteActionPayloadForGraphTableJob {
SerializableSupplier<Function<G, String>> rowIdFn = ModelSupport::idFn; SerializableSupplier<Function<G, String>> rowIdFn = ModelSupport::idFn;
SerializableSupplier<Function<A, String>> actionPayloadIdFn = ModelSupport::idFn; SerializableSupplier<Function<A, String>> actionPayloadIdFn = ModelSupport::idFn;
SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet.functionFor(strategy); SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(strategy); .functionFor(mergeAndGetStrategy);
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(mergeAndGetStrategy);
SerializableSupplier<G> zeroFn = zeroFn(rowClazz); SerializableSupplier<G> zeroFn = zeroFn(rowClazz);
SerializableSupplier<Function<G, Boolean>> isNotZeroFn = PromoteActionPayloadForGraphTableJob::isNotZeroFnUsingIdOrSourceAndTarget; SerializableSupplier<Function<G, Boolean>> isNotZeroFn = PromoteActionPayloadForGraphTableJob::isNotZeroFnUsingIdOrSourceAndTarget;
@ -207,6 +219,7 @@ public class PromoteActionPayloadForGraphTableJob {
rowIdFn, rowIdFn,
actionPayloadIdFn, actionPayloadIdFn,
mergeRowWithActionPayloadAndGetFn, mergeRowWithActionPayloadAndGetFn,
promoteActionStrategy,
rowClazz, rowClazz,
actionPayloadClazz); actionPayloadClazz);

View File

@ -34,6 +34,7 @@ public class PromoteActionPayloadFunctions {
* @param rowIdFn Function used to get the id of graph table row * @param rowIdFn Function used to get the id of graph table row
* @param actionPayloadIdFn Function used to get id of action payload instance * @param actionPayloadIdFn Function used to get id of action payload instance
* @param mergeAndGetFn Function used to merge graph table row and action payload instance * @param mergeAndGetFn Function used to merge graph table row and action payload instance
* @param promoteActionStrategy the Actionset promotion strategy
* @param rowClazz Class of graph table * @param rowClazz Class of graph table
* @param actionPayloadClazz Class of action payload * @param actionPayloadClazz Class of action payload
* @param <G> Type of graph table row * @param <G> Type of graph table row
@ -46,6 +47,7 @@ public class PromoteActionPayloadFunctions {
SerializableSupplier<Function<G, String>> rowIdFn, SerializableSupplier<Function<G, String>> rowIdFn,
SerializableSupplier<Function<A, String>> actionPayloadIdFn, SerializableSupplier<Function<A, String>> actionPayloadIdFn,
SerializableSupplier<BiFunction<G, A, G>> mergeAndGetFn, SerializableSupplier<BiFunction<G, A, G>> mergeAndGetFn,
PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz, Class<G> rowClazz,
Class<A> actionPayloadClazz) { Class<A> actionPayloadClazz) {
if (!isSubClass(rowClazz, actionPayloadClazz)) { if (!isSubClass(rowClazz, actionPayloadClazz)) {
@ -61,7 +63,7 @@ public class PromoteActionPayloadFunctions {
.joinWith( .joinWith(
actionPayloadWithIdDS, actionPayloadWithIdDS,
rowWithIdDS.col("_1").equalTo(actionPayloadWithIdDS.col("_1")), rowWithIdDS.col("_1").equalTo(actionPayloadWithIdDS.col("_1")),
"full_outer") PromoteAction.joinTypeForStrategy(promoteActionStrategy))
.map( .map(
(MapFunction<Tuple2<Tuple2<String, G>, Tuple2<String, A>>, G>) value -> { (MapFunction<Tuple2<Tuple2<String, G>, Tuple2<String, A>>, G>) value -> {
Optional<G> rowOpt = Optional.ofNullable(value._1()).map(Tuple2::_2); Optional<G> rowOpt = Optional.ofNullable(value._1()).map(Tuple2::_2);

View File

@ -41,6 +41,12 @@
"paramDescription": "strategy for merging graph table objects with action payload instances, MERGE_FROM_AND_GET or SELECT_NEWER_AND_GET", "paramDescription": "strategy for merging graph table objects with action payload instances, MERGE_FROM_AND_GET or SELECT_NEWER_AND_GET",
"paramRequired": true "paramRequired": true
}, },
{
"paramName": "pas",
"paramLongName": "promoteActionStrategy",
"paramDescription": "strategy for promoting the actionset contents into the graph tables, ENRICH or UPSERT (default)",
"paramRequired": false
},
{ {
"paramName": "sgid", "paramName": "sgid",
"paramLongName": "shouldGroupById", "paramLongName": "shouldGroupById",

Some files were not shown because too many files have changed in this diff Show More