forked from D-Net/dnet-hadoop
mergin with branch beta
This commit is contained in:
commit
e9131f4e4a
|
@ -26,3 +26,4 @@ spark-warehouse
|
|||
/**/*.log
|
||||
/**/.factorypath
|
||||
/**/.scalafmt.conf
|
||||
/.java-version
|
||||
|
|
|
@ -0,0 +1,43 @@
|
|||
# Contributor Code of Conduct
|
||||
|
||||
Openness, transparency and our community-driven participatory approach guide us in our day-to-day interactions and decision-making. Our open source projects are no exception. Trust, respect, collaboration and transparency are core values we believe should live and breathe within our projects. Our community welcomes participants from around the world with different experiences, unique perspectives, and great ideas to share.
|
||||
|
||||
## Our Pledge
|
||||
|
||||
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
|
||||
|
||||
## Our Standards
|
||||
|
||||
Examples of behavior that contributes to creating a positive environment include:
|
||||
|
||||
- Using welcoming and inclusive language
|
||||
- Being respectful of differing viewpoints and experiences
|
||||
- Gracefully accepting constructive criticism
|
||||
- Attempting collaboration before conflict
|
||||
- Focusing on what is best for the community
|
||||
- Showing empathy towards other community members
|
||||
|
||||
Examples of unacceptable behavior by participants include:
|
||||
|
||||
- Violence, threats of violence, or inciting others to commit self-harm
|
||||
- The use of sexualized language or imagery and unwelcome sexual attention or advances
|
||||
- Trolling, intentionally spreading misinformation, insulting/derogatory comments, and personal or political attacks
|
||||
- Public or private harassment
|
||||
- Publishing others' private information, such as a physical or electronic address, without explicit permission
|
||||
- Abuse of the reporting process to intentionally harass or exclude others
|
||||
- Advocating for, or encouraging, any of the above behavior
|
||||
- Other conduct which could reasonably be considered inappropriate in a professional setting
|
||||
|
||||
## Our Responsibilities
|
||||
|
||||
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
|
||||
|
||||
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
|
||||
|
||||
## Scope
|
||||
|
||||
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
|
||||
|
||||
## Attribution
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), [version 1.4](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html).
|
|
@ -0,0 +1,10 @@
|
|||
# Contributing to D-Net Hadoop
|
||||
|
||||
:+1::tada: First off, thanks for taking the time to contribute! :tada::+1:
|
||||
|
||||
This project and everyone participating in it is governed by our [Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
|
||||
|
||||
The following is a set of guidelines for contributing to this project and its packages. These are mostly guidelines, not rules, which applies to this project as a while, including all its sub-modules.
|
||||
Use your best judgment, and feel free to propose changes to this document in a pull request.
|
||||
|
||||
All contributions are welcome, all contributions will be considered to be contributed under the [project license](LICENSE.md).
|
133
README.md
133
README.md
|
@ -1,2 +1,133 @@
|
|||
# dnet-hadoop
|
||||
Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
|
||||
|
||||
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
|
||||
|
||||
This project adheres to the Contributor Covenant [code of conduct](CODE_OF_CONDUCT.md).
|
||||
By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
|
||||
|
||||
This project is licensed under the [AGPL v3 or later version](#LICENSE.md).
|
||||
|
||||
How to build, package and run oozie workflows
|
||||
====================
|
||||
|
||||
Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz`
|
||||
package that contains resources that define a workflow and some helper scripts.
|
||||
|
||||
This module is automatically executed when running:
|
||||
|
||||
`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
|
||||
|
||||
on module having set:
|
||||
|
||||
```
|
||||
<parent>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-workflows</artifactId>
|
||||
</parent>
|
||||
```
|
||||
|
||||
in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to
|
||||
a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory).
|
||||
|
||||
The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
|
||||
|
||||
- jar packages
|
||||
- workflow definitions
|
||||
- job properties
|
||||
- maintenance scripts
|
||||
|
||||
Required properties
|
||||
====================
|
||||
|
||||
In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided
|
||||
by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
|
||||
|
||||
In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with
|
||||
the following properties:
|
||||
|
||||
- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine
|
||||
- `dhp.hadoop.frontend.host.name` - frontend host name
|
||||
- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files
|
||||
- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port
|
||||
- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job
|
||||
- `nameNode` - name node address
|
||||
- `jobTracker` - job tracker address
|
||||
- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output
|
||||
produced by `run_workflow.sh` script (needed to obtain oozie job id)
|
||||
- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster
|
||||
- `sparkDriverMemory` - amount of memory assigned to spark jobs driver
|
||||
- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors
|
||||
- `sparkExecutorCores` - number of cores assigned to spark jobs executors
|
||||
|
||||
All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's
|
||||
main folder.
|
||||
|
||||
When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory
|
||||
(the one containing `pom.xml` file) and define all new properties which will override existing properties.
|
||||
One can provide those properties one by one as command line `-D` arguments.
|
||||
|
||||
Properties overriding order is the following:
|
||||
|
||||
1. `pom.xml` defined properties (located in the project root dir)
|
||||
2. `~/.dhp/application.properties` defined properties
|
||||
3. `${workflow.source.dir}/job.properties`
|
||||
4. `job-override.properties` (located in the project root dir)
|
||||
5. `maven -Dparam=value`
|
||||
|
||||
where the maven `-Dparam` property is overriding all the other ones.
|
||||
|
||||
Workflow definition requirements
|
||||
====================
|
||||
|
||||
`workflow.source.dir` property should point to the following directory structure:
|
||||
|
||||
[${workflow.source.dir}]
|
||||
|
|
||||
|-job.properties (optional)
|
||||
|
|
||||
\-[oozie_app]
|
||||
|
|
||||
\-workflow.xml
|
||||
|
||||
This property can be set using maven `-D` switch.
|
||||
|
||||
`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is
|
||||
provided with directory name as value.
|
||||
|
||||
Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory.
|
||||
|
||||
Creating oozie installer step-by-step
|
||||
=====================================
|
||||
|
||||
Automated oozie-installer steps are the following:
|
||||
|
||||
1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies`
|
||||
2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties`
|
||||
3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
|
||||
4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
|
||||
5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}`
|
||||
6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven,
|
||||
`~/.dhp/application.properties`, `job.properties` and `job-override.properties`
|
||||
7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages
|
||||
created at step (1) to each one of them
|
||||
8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package
|
||||
|
||||
Uploading oozie package and running workflow on cluster
|
||||
=======================================================
|
||||
|
||||
In order to simplify deployment and execution process two dedicated profiles were introduced:
|
||||
|
||||
- `deploy`
|
||||
- `run`
|
||||
|
||||
to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters.
|
||||
|
||||
The `deploy` profile supplements packaging process with:
|
||||
1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine
|
||||
2) extracting uploaded package
|
||||
3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties)
|
||||
|
||||
The `run` profile introduces:
|
||||
1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file.
|
||||
|
||||
Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.
|
|
@ -52,6 +52,8 @@
|
|||
</execution>
|
||||
</executions>
|
||||
<configuration>
|
||||
<failOnMultipleScalaVersions>true</failOnMultipleScalaVersions>
|
||||
<scalaCompatVersion>${scala.binary.version}</scalaCompatVersion>
|
||||
<scalaVersion>${scala.version}</scalaVersion>
|
||||
</configuration>
|
||||
</plugin>
|
||||
|
@ -60,6 +62,11 @@
|
|||
</build>
|
||||
|
||||
<dependencies>
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-pace-core</artifactId>
|
||||
<version>${project.version}</version>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>org.apache.hadoop</groupId>
|
||||
|
@ -76,11 +83,11 @@
|
|||
|
||||
<dependency>
|
||||
<groupId>org.apache.spark</groupId>
|
||||
<artifactId>spark-core_2.11</artifactId>
|
||||
<artifactId>spark-core_${scala.binary.version}</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.apache.spark</groupId>
|
||||
<artifactId>spark-sql_2.11</artifactId>
|
||||
<artifactId>spark-sql_${scala.binary.version}</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
|
@ -142,11 +149,6 @@
|
|||
<artifactId>okhttp</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib</groupId>
|
||||
<artifactId>dnet-pace-core</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>org.apache.httpcomponents</groupId>
|
||||
<artifactId>httpclient</artifactId>
|
||||
|
@ -159,7 +161,7 @@
|
|||
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-schemas</artifactId>
|
||||
<artifactId>${dhp-schemas.artifact}</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
|
|
|
@ -51,6 +51,7 @@ public class Constants {
|
|||
public static final String RETRY_DELAY = "retryDelay";
|
||||
public static final String CONNECT_TIMEOUT = "connectTimeOut";
|
||||
public static final String READ_TIMEOUT = "readTimeOut";
|
||||
public static final String REQUEST_METHOD = "requestMethod";
|
||||
public static final String FROM_DATE_OVERRIDE = "fromDateOverride";
|
||||
public static final String UNTIL_DATE_OVERRIDE = "untilDateOverride";
|
||||
|
||||
|
|
|
@ -0,0 +1,39 @@
|
|||
|
||||
package eu.dnetlib.dhp.common.api.context;
|
||||
|
||||
public class CategorySummary {
|
||||
|
||||
private String id;
|
||||
|
||||
private String label;
|
||||
|
||||
private boolean hasConcept;
|
||||
|
||||
public String getId() {
|
||||
return id;
|
||||
}
|
||||
|
||||
public String getLabel() {
|
||||
return label;
|
||||
}
|
||||
|
||||
public boolean isHasConcept() {
|
||||
return hasConcept;
|
||||
}
|
||||
|
||||
public CategorySummary setId(final String id) {
|
||||
this.id = id;
|
||||
return this;
|
||||
}
|
||||
|
||||
public CategorySummary setLabel(final String label) {
|
||||
this.label = label;
|
||||
return this;
|
||||
}
|
||||
|
||||
public CategorySummary setHasConcept(final boolean hasConcept) {
|
||||
this.hasConcept = hasConcept;
|
||||
return this;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,7 @@
|
|||
|
||||
package eu.dnetlib.dhp.common.api.context;
|
||||
|
||||
import java.util.ArrayList;
|
||||
|
||||
public class CategorySummaryList extends ArrayList<CategorySummary> {
|
||||
}
|
|
@ -0,0 +1,52 @@
|
|||
|
||||
package eu.dnetlib.dhp.common.api.context;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
public class ConceptSummary {
|
||||
|
||||
private String id;
|
||||
|
||||
private String label;
|
||||
|
||||
public boolean hasSubConcept;
|
||||
|
||||
private List<ConceptSummary> concepts;
|
||||
|
||||
public String getId() {
|
||||
return id;
|
||||
}
|
||||
|
||||
public String getLabel() {
|
||||
return label;
|
||||
}
|
||||
|
||||
public List<ConceptSummary> getConcepts() {
|
||||
return concepts;
|
||||
}
|
||||
|
||||
public ConceptSummary setId(final String id) {
|
||||
this.id = id;
|
||||
return this;
|
||||
}
|
||||
|
||||
public ConceptSummary setLabel(final String label) {
|
||||
this.label = label;
|
||||
return this;
|
||||
}
|
||||
|
||||
public boolean isHasSubConcept() {
|
||||
return hasSubConcept;
|
||||
}
|
||||
|
||||
public ConceptSummary setHasSubConcept(final boolean hasSubConcept) {
|
||||
this.hasSubConcept = hasSubConcept;
|
||||
return this;
|
||||
}
|
||||
|
||||
public ConceptSummary setConcept(final List<ConceptSummary> concepts) {
|
||||
this.concepts = concepts;
|
||||
return this;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,7 @@
|
|||
|
||||
package eu.dnetlib.dhp.common.api.context;
|
||||
|
||||
import java.util.ArrayList;
|
||||
|
||||
public class ConceptSummaryList extends ArrayList<ConceptSummary> {
|
||||
}
|
|
@ -0,0 +1,50 @@
|
|||
|
||||
package eu.dnetlib.dhp.common.api.context;
|
||||
|
||||
public class ContextSummary {
|
||||
|
||||
private String id;
|
||||
|
||||
private String label;
|
||||
|
||||
private String type;
|
||||
|
||||
private String status;
|
||||
|
||||
public String getId() {
|
||||
return id;
|
||||
}
|
||||
|
||||
public String getLabel() {
|
||||
return label;
|
||||
}
|
||||
|
||||
public String getType() {
|
||||
return type;
|
||||
}
|
||||
|
||||
public String getStatus() {
|
||||
return status;
|
||||
}
|
||||
|
||||
public ContextSummary setId(final String id) {
|
||||
this.id = id;
|
||||
return this;
|
||||
}
|
||||
|
||||
public ContextSummary setLabel(final String label) {
|
||||
this.label = label;
|
||||
return this;
|
||||
}
|
||||
|
||||
public ContextSummary setType(final String type) {
|
||||
this.type = type;
|
||||
return this;
|
||||
}
|
||||
|
||||
public ContextSummary setStatus(final String status) {
|
||||
this.status = status;
|
||||
return this;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,7 @@
|
|||
|
||||
package eu.dnetlib.dhp.common.api.context;
|
||||
|
||||
import java.util.ArrayList;
|
||||
|
||||
public class ContextSummaryList extends ArrayList<ContextSummary> {
|
||||
}
|
|
@ -1,6 +1,9 @@
|
|||
|
||||
package eu.dnetlib.dhp.common.collection;
|
||||
|
||||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
|
||||
/**
|
||||
* Bundles the http connection parameters driving the client behaviour.
|
||||
*/
|
||||
|
@ -13,6 +16,8 @@ public class HttpClientParams {
|
|||
public static int _connectTimeOut = 10; // seconds
|
||||
public static int _readTimeOut = 30; // seconds
|
||||
|
||||
public static String _requestMethod = "GET";
|
||||
|
||||
/**
|
||||
* Maximum number of allowed retires before failing
|
||||
*/
|
||||
|
@ -38,17 +43,30 @@ public class HttpClientParams {
|
|||
*/
|
||||
private int readTimeOut;
|
||||
|
||||
/**
|
||||
* Custom http headers
|
||||
*/
|
||||
private Map<String, String> headers;
|
||||
|
||||
/**
|
||||
* Request method (i.e., GET, POST etc)
|
||||
*/
|
||||
private String requestMethod;
|
||||
|
||||
public HttpClientParams() {
|
||||
this(_maxNumberOfRetry, _requestDelay, _retryDelay, _connectTimeOut, _readTimeOut);
|
||||
this(_maxNumberOfRetry, _requestDelay, _retryDelay, _connectTimeOut, _readTimeOut, new HashMap<>(),
|
||||
_requestMethod);
|
||||
}
|
||||
|
||||
public HttpClientParams(int maxNumberOfRetry, int requestDelay, int retryDelay, int connectTimeOut,
|
||||
int readTimeOut) {
|
||||
int readTimeOut, Map<String, String> headers, String requestMethod) {
|
||||
this.maxNumberOfRetry = maxNumberOfRetry;
|
||||
this.requestDelay = requestDelay;
|
||||
this.retryDelay = retryDelay;
|
||||
this.connectTimeOut = connectTimeOut;
|
||||
this.readTimeOut = readTimeOut;
|
||||
this.headers = headers;
|
||||
this.requestMethod = requestMethod;
|
||||
}
|
||||
|
||||
public int getMaxNumberOfRetry() {
|
||||
|
@ -91,4 +109,19 @@ public class HttpClientParams {
|
|||
this.readTimeOut = readTimeOut;
|
||||
}
|
||||
|
||||
public Map<String, String> getHeaders() {
|
||||
return headers;
|
||||
}
|
||||
|
||||
public void setHeaders(Map<String, String> headers) {
|
||||
this.headers = headers;
|
||||
}
|
||||
|
||||
public String getRequestMethod() {
|
||||
return requestMethod;
|
||||
}
|
||||
|
||||
public void setRequestMethod(String requestMethod) {
|
||||
this.requestMethod = requestMethod;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -8,10 +8,13 @@ import java.io.InputStream;
|
|||
import java.net.*;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.lang3.math.NumberUtils;
|
||||
import org.apache.commons.lang3.time.DateUtils;
|
||||
import org.apache.http.HttpHeaders;
|
||||
import org.joda.time.Instant;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
|
@ -94,23 +97,31 @@ public class HttpConnector2 {
|
|||
throw new CollectorException(msg);
|
||||
}
|
||||
|
||||
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
|
||||
|
||||
InputStream input = null;
|
||||
|
||||
long start = System.currentTimeMillis();
|
||||
try {
|
||||
if (getClientParams().getRequestDelay() > 0) {
|
||||
backoffAndSleep(getClientParams().getRequestDelay());
|
||||
}
|
||||
|
||||
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
|
||||
|
||||
final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
|
||||
urlConn.setInstanceFollowRedirects(false);
|
||||
urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
|
||||
urlConn.setConnectTimeout(getClientParams().getConnectTimeOut() * 1000);
|
||||
urlConn.addRequestProperty(HttpHeaders.USER_AGENT, userAgent);
|
||||
urlConn.setRequestMethod(getClientParams().getRequestMethod());
|
||||
|
||||
if (log.isDebugEnabled()) {
|
||||
logHeaderFields(urlConn);
|
||||
// if provided, add custom headers
|
||||
if (!getClientParams().getHeaders().isEmpty()) {
|
||||
for (Map.Entry<String, String> headerEntry : getClientParams().getHeaders().entrySet()) {
|
||||
urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
|
||||
}
|
||||
}
|
||||
|
||||
logHeaderFields(urlConn);
|
||||
|
||||
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
|
||||
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
|
||||
|
@ -125,9 +136,7 @@ public class HttpConnector2 {
|
|||
}
|
||||
|
||||
if (is2xx(urlConn.getResponseCode())) {
|
||||
input = urlConn.getInputStream();
|
||||
responseType = urlConn.getContentType();
|
||||
return input;
|
||||
return getInputStream(urlConn, start);
|
||||
}
|
||||
if (is3xx(urlConn.getResponseCode())) {
|
||||
// REDIRECTS
|
||||
|
@ -137,6 +146,7 @@ public class HttpConnector2 {
|
|||
.put(
|
||||
REPORT_PREFIX + urlConn.getResponseCode(),
|
||||
String.format("Moved to: %s", newUrl));
|
||||
logRequestTime(start);
|
||||
urlConn.disconnect();
|
||||
if (retryAfter > 0) {
|
||||
backoffAndSleep(retryAfter);
|
||||
|
@ -152,26 +162,50 @@ public class HttpConnector2 {
|
|||
if (retryAfter > 0) {
|
||||
log
|
||||
.warn(
|
||||
"{} - waiting and repeating request after suggested retry-after {} sec.",
|
||||
requestUrl, retryAfter);
|
||||
"waiting and repeating request after suggested retry-after {} sec for URL {}",
|
||||
retryAfter, requestUrl);
|
||||
backoffAndSleep(retryAfter * 1000);
|
||||
} else {
|
||||
log
|
||||
.warn(
|
||||
"{} - waiting and repeating request after default delay of {} sec.",
|
||||
requestUrl, getClientParams().getRetryDelay());
|
||||
backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000);
|
||||
"waiting and repeating request after default delay of {} sec for URL {}",
|
||||
getClientParams().getRetryDelay(), requestUrl);
|
||||
backoffAndSleep(retryNumber * getClientParams().getRetryDelay());
|
||||
}
|
||||
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
|
||||
|
||||
logRequestTime(start);
|
||||
|
||||
urlConn.disconnect();
|
||||
|
||||
return attemptDownload(requestUrl, retryNumber + 1, report);
|
||||
case 422: // UNPROCESSABLE ENTITY
|
||||
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
|
||||
log.warn("waiting and repeating request after 10 sec for URL {}", requestUrl);
|
||||
backoffAndSleep(10000);
|
||||
urlConn.disconnect();
|
||||
logRequestTime(start);
|
||||
try {
|
||||
return getInputStream(urlConn, start);
|
||||
} catch (IOException e) {
|
||||
log
|
||||
.error(
|
||||
"server returned 422 and got IOException accessing the response body from URL {}",
|
||||
requestUrl);
|
||||
log.error("IOException:", e);
|
||||
return attemptDownload(requestUrl, retryNumber + 1, report);
|
||||
}
|
||||
default:
|
||||
log.error("gor error {} from URL: {}", urlConn.getResponseCode(), urlConn.getURL());
|
||||
log.error("response message: {}", urlConn.getResponseMessage());
|
||||
report
|
||||
.put(
|
||||
REPORT_PREFIX + urlConn.getResponseCode(),
|
||||
String
|
||||
.format(
|
||||
"%s Error: %s", requestUrl, urlConn.getResponseMessage()));
|
||||
logRequestTime(start);
|
||||
urlConn.disconnect();
|
||||
throw new CollectorException(urlConn.getResponseCode() + " error " + report);
|
||||
}
|
||||
}
|
||||
|
@ -192,13 +226,27 @@ public class HttpConnector2 {
|
|||
}
|
||||
}
|
||||
|
||||
private InputStream getInputStream(HttpURLConnection urlConn, long start) throws IOException {
|
||||
InputStream input = urlConn.getInputStream();
|
||||
responseType = urlConn.getContentType();
|
||||
logRequestTime(start);
|
||||
return input;
|
||||
}
|
||||
|
||||
private static void logRequestTime(long start) {
|
||||
log
|
||||
.info(
|
||||
"request time elapsed: {}sec",
|
||||
TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis() - start));
|
||||
}
|
||||
|
||||
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
|
||||
log.debug("StatusCode: {}", urlConn.getResponseMessage());
|
||||
log.info("Response: {} - {}", urlConn.getResponseCode(), urlConn.getResponseMessage());
|
||||
|
||||
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
|
||||
if (e.getKey() != null) {
|
||||
for (String v : e.getValue()) {
|
||||
log.debug(" key: {} - value: {}", e.getKey(), v);
|
||||
log.info(" key: {} - value: {}", e.getKey(), v);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -218,7 +266,7 @@ public class HttpConnector2 {
|
|||
for (String key : headerMap.keySet()) {
|
||||
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
|
||||
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) {
|
||||
return Integer.parseInt(headerMap.get(key).get(0)) + 10;
|
||||
return Integer.parseInt(headerMap.get(key).get(0));
|
||||
}
|
||||
}
|
||||
return -1;
|
||||
|
|
|
@ -63,7 +63,10 @@ public class Vocabulary implements Serializable {
|
|||
}
|
||||
|
||||
public VocabularyTerm getTermBySynonym(final String syn) {
|
||||
return getTerm(synonyms.get(syn.toLowerCase()));
|
||||
return Optional
|
||||
.ofNullable(syn)
|
||||
.map(s -> getTerm(synonyms.get(s.toLowerCase())))
|
||||
.orElse(null);
|
||||
}
|
||||
|
||||
public Qualifier getTermAsQualifier(final String termId) {
|
||||
|
|
|
@ -135,6 +135,24 @@ public class VocabularyGroup implements Serializable {
|
|||
return vocs.get(vocId.toLowerCase()).getSynonymAsQualifier(syn);
|
||||
}
|
||||
|
||||
public Qualifier lookupTermBySynonym(final String vocId, final String syn) {
|
||||
return find(vocId)
|
||||
.map(
|
||||
vocabulary -> Optional
|
||||
.ofNullable(vocabulary.getTerm(syn))
|
||||
.map(
|
||||
term -> OafMapperUtils
|
||||
.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
|
||||
.orElse(
|
||||
Optional
|
||||
.ofNullable(vocabulary.getTermBySynonym(syn))
|
||||
.map(
|
||||
term -> OafMapperUtils
|
||||
.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
|
||||
.orElse(null)))
|
||||
.orElse(null);
|
||||
}
|
||||
|
||||
/**
|
||||
* getSynonymAsQualifierCaseSensitive
|
||||
*
|
||||
|
|
|
@ -119,6 +119,131 @@ public class AuthorMerger {
|
|||
});
|
||||
}
|
||||
|
||||
public static String normalizeFullName(final String fullname) {
|
||||
return nfd(fullname)
|
||||
.toLowerCase()
|
||||
// do not compact the regexes in a single expression, would cause StackOverflowError
|
||||
// in case
|
||||
// of large input strings
|
||||
.replaceAll("(\\W)+", " ")
|
||||
.replaceAll("(\\p{InCombiningDiacriticalMarks})+", " ")
|
||||
.replaceAll("(\\p{Punct})+", " ")
|
||||
.replaceAll("(\\d)+", " ")
|
||||
.replaceAll("(\\n)+", " ")
|
||||
|
||||
.trim();
|
||||
}
|
||||
|
||||
private static String authorFieldToBeCompared(Author author) {
|
||||
if (StringUtils.isNotBlank(author.getSurname())) {
|
||||
return author.getSurname();
|
||||
|
||||
}
|
||||
if (StringUtils.isNotBlank(author.getFullname())) {
|
||||
return author.getFullname();
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* This method tries to figure out when two author are the same in the contest
|
||||
* of ORCID enrichment
|
||||
*
|
||||
* @param left Author in the OAF entity
|
||||
* @param right Author ORCID
|
||||
* @return based on a heuristic on the names of the authors if they are the same.
|
||||
*/
|
||||
public static boolean checkORCIDSimilarity(final Author left, final Author right) {
|
||||
final Person pl = parse(left);
|
||||
final Person pr = parse(right);
|
||||
|
||||
// If one of them didn't have a surname we verify if they have the fullName not empty
|
||||
// and verify if the normalized version is equal
|
||||
if (!(pl.getSurname() != null && pl.getSurname().stream().anyMatch(StringUtils::isNotBlank) &&
|
||||
pr.getSurname() != null && pr.getSurname().stream().anyMatch(StringUtils::isNotBlank))) {
|
||||
|
||||
if (pl.getFullname() != null && !pl.getFullname().isEmpty() && pr.getFullname() != null
|
||||
&& !pr.getFullname().isEmpty()) {
|
||||
return pl
|
||||
.getFullname()
|
||||
.stream()
|
||||
.anyMatch(
|
||||
fl -> pr.getFullname().stream().anyMatch(fr -> normalize(fl).equalsIgnoreCase(normalize(fr))));
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
// The Authors have one surname in common
|
||||
if (pl.getSurname().stream().anyMatch(sl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(sl)))) {
|
||||
|
||||
// If one of them has only a surname and is the same we can say that they are the same author
|
||||
if ((pl.getName() == null || pl.getName().stream().allMatch(StringUtils::isBlank)) ||
|
||||
(pr.getName() == null || pr.getName().stream().allMatch(StringUtils::isBlank)))
|
||||
return true;
|
||||
// The authors have the same initials of Name in common
|
||||
if (pl
|
||||
.getName()
|
||||
.stream()
|
||||
.anyMatch(
|
||||
nl -> pr
|
||||
.getName()
|
||||
.stream()
|
||||
.anyMatch(nr -> nr.equalsIgnoreCase(nl))))
|
||||
return true;
|
||||
}
|
||||
|
||||
// Sometimes we noticed that publication have author wrote in inverse order Surname, Name
|
||||
// We verify if we have an exact match between name and surname
|
||||
if (pl.getSurname().stream().anyMatch(sl -> pr.getName().stream().anyMatch(nr -> nr.equalsIgnoreCase(sl))) &&
|
||||
pl.getName().stream().anyMatch(nl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(nl))))
|
||||
return true;
|
||||
else
|
||||
return false;
|
||||
}
|
||||
//
|
||||
|
||||
/**
|
||||
* Method to enrich ORCID information in one list of authors based on another list
|
||||
*
|
||||
* @param baseAuthor the Author List in the OAF Entity
|
||||
* @param orcidAuthor The list of ORCID Author intersected
|
||||
* @return The Author List of the OAF Entity enriched with the orcid Author
|
||||
*/
|
||||
public static List<Author> enrichOrcid(List<Author> baseAuthor, List<Author> orcidAuthor) {
|
||||
|
||||
if (baseAuthor == null || baseAuthor.isEmpty())
|
||||
return orcidAuthor;
|
||||
|
||||
if (orcidAuthor == null || orcidAuthor.isEmpty())
|
||||
return baseAuthor;
|
||||
|
||||
if (baseAuthor.size() == 1 && orcidAuthor.size() > 10)
|
||||
return baseAuthor;
|
||||
|
||||
final List<Author> oAuthor = new ArrayList<>();
|
||||
oAuthor.addAll(orcidAuthor);
|
||||
|
||||
baseAuthor.forEach(ba -> {
|
||||
Optional<Author> aMatch = oAuthor.stream().filter(oa -> checkORCIDSimilarity(ba, oa)).findFirst();
|
||||
if (aMatch.isPresent()) {
|
||||
final Author sameAuthor = aMatch.get();
|
||||
addPid(ba, sameAuthor.getPid());
|
||||
oAuthor.remove(sameAuthor);
|
||||
}
|
||||
});
|
||||
return baseAuthor;
|
||||
}
|
||||
|
||||
private static void addPid(final Author a, final List<StructuredProperty> pids) {
|
||||
|
||||
if (a.getPid() == null) {
|
||||
a.setPid(new ArrayList<>());
|
||||
}
|
||||
|
||||
a.getPid().addAll(pids);
|
||||
|
||||
}
|
||||
|
||||
public static String pidToComparableString(StructuredProperty pid) {
|
||||
final String classid = pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase()
|
||||
: "";
|
||||
|
@ -171,7 +296,7 @@ public class AuthorMerger {
|
|||
}
|
||||
}
|
||||
|
||||
private static String normalize(final String s) {
|
||||
public static String normalize(final String s) {
|
||||
String[] normalized = nfd(s)
|
||||
.toLowerCase()
|
||||
// do not compact the regexes in a single expression, would cause StackOverflowError
|
||||
|
|
|
@ -1,97 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.oa.merge;
|
||||
|
||||
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
|
||||
|
||||
import java.util.Objects;
|
||||
import java.util.Optional;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.function.FilterFunction;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.sql.Encoders;
|
||||
import org.apache.spark.sql.SaveMode;
|
||||
import org.apache.spark.sql.SparkSession;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.common.HdfsSupport;
|
||||
import eu.dnetlib.dhp.schema.oaf.Oaf;
|
||||
import eu.dnetlib.dhp.schema.oaf.OafEntity;
|
||||
|
||||
public class DispatchEntitiesSparkJob {
|
||||
|
||||
private static final Logger log = LoggerFactory.getLogger(DispatchEntitiesSparkJob.class);
|
||||
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
|
||||
public static void main(String[] args) throws Exception {
|
||||
|
||||
String jsonConfiguration = IOUtils
|
||||
.toString(
|
||||
Objects
|
||||
.requireNonNull(
|
||||
DispatchEntitiesSparkJob.class
|
||||
.getResourceAsStream(
|
||||
"/eu/dnetlib/dhp/oa/merge/dispatch_entities_parameters.json")));
|
||||
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
|
||||
parser.parseArgument(args);
|
||||
|
||||
Boolean isSparkSessionManaged = Optional
|
||||
.ofNullable(parser.get("isSparkSessionManaged"))
|
||||
.map(Boolean::valueOf)
|
||||
.orElse(Boolean.TRUE);
|
||||
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
|
||||
|
||||
String inputPath = parser.get("inputPath");
|
||||
log.info("inputPath: {}", inputPath);
|
||||
|
||||
String outputPath = parser.get("outputPath");
|
||||
log.info("outputPath: {}", outputPath);
|
||||
|
||||
String graphTableClassName = parser.get("graphTableClassName");
|
||||
log.info("graphTableClassName: {}", graphTableClassName);
|
||||
|
||||
@SuppressWarnings("unchecked")
|
||||
Class<? extends OafEntity> entityClazz = (Class<? extends OafEntity>) Class.forName(graphTableClassName);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
runWithSparkSession(
|
||||
conf,
|
||||
isSparkSessionManaged,
|
||||
spark -> {
|
||||
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
|
||||
dispatchEntities(spark, inputPath, entityClazz, outputPath);
|
||||
});
|
||||
}
|
||||
|
||||
private static <T extends Oaf> void dispatchEntities(
|
||||
SparkSession spark,
|
||||
String inputPath,
|
||||
Class<T> clazz,
|
||||
String outputPath) {
|
||||
|
||||
spark
|
||||
.read()
|
||||
.textFile(inputPath)
|
||||
.filter((FilterFunction<String>) s -> isEntityType(s, clazz))
|
||||
.map((MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"), Encoders.STRING())
|
||||
.map(
|
||||
(MapFunction<String, T>) value -> OBJECT_MAPPER.readValue(value, clazz),
|
||||
Encoders.bean(clazz))
|
||||
.write()
|
||||
.mode(SaveMode.Overwrite)
|
||||
.option("compression", "gzip")
|
||||
.json(outputPath);
|
||||
}
|
||||
|
||||
private static <T extends Oaf> boolean isEntityType(final String s, final Class<T> clazz) {
|
||||
return StringUtils.substringBefore(s, "|").equals(clazz.getName());
|
||||
}
|
||||
|
||||
}
|
|
@ -2,50 +2,49 @@
|
|||
package eu.dnetlib.dhp.oa.merge;
|
||||
|
||||
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
|
||||
import static eu.dnetlib.dhp.utils.DHPUtils.toSeq;
|
||||
import static org.apache.spark.sql.functions.col;
|
||||
import static org.apache.spark.sql.functions.when;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.List;
|
||||
import java.util.Objects;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
import java.util.concurrent.ExecutionException;
|
||||
import java.util.concurrent.ForkJoinPool;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.JavaSparkContext;
|
||||
import org.apache.spark.api.java.function.FilterFunction;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.api.java.function.ReduceFunction;
|
||||
import org.apache.spark.sql.*;
|
||||
import org.apache.spark.sql.expressions.Aggregator;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import com.fasterxml.jackson.databind.DeserializationFeature;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.jayway.jsonpath.Configuration;
|
||||
import com.jayway.jsonpath.DocumentContext;
|
||||
import com.jayway.jsonpath.JsonPath;
|
||||
import com.jayway.jsonpath.Option;
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.common.HdfsSupport;
|
||||
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
|
||||
import eu.dnetlib.dhp.schema.common.EntityType;
|
||||
import eu.dnetlib.dhp.schema.common.ModelSupport;
|
||||
import eu.dnetlib.dhp.schema.oaf.*;
|
||||
import eu.dnetlib.dhp.schema.oaf.OafEntity;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.GraphCleaningFunctions;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
|
||||
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
|
||||
import scala.Tuple2;
|
||||
|
||||
/**
|
||||
* Groups the graph content by entity identifier to ensure ID uniqueness
|
||||
*/
|
||||
public class GroupEntitiesSparkJob {
|
||||
|
||||
private static final Logger log = LoggerFactory.getLogger(GroupEntitiesSparkJob.class);
|
||||
|
||||
private static final String ID_JPATH = "$.id";
|
||||
private static final Encoder<OafEntity> OAFENTITY_KRYO_ENC = Encoders.kryo(OafEntity.class);
|
||||
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
|
||||
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
|
||||
private ArgumentApplicationParser parser;
|
||||
|
||||
public GroupEntitiesSparkJob(ArgumentApplicationParser parser) {
|
||||
this.parser = parser;
|
||||
}
|
||||
|
||||
public static void main(String[] args) throws Exception {
|
||||
|
||||
|
@ -63,141 +62,133 @@ public class GroupEntitiesSparkJob {
|
|||
.orElse(Boolean.TRUE);
|
||||
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
|
||||
|
||||
final String isLookupUrl = parser.get("isLookupUrl");
|
||||
log.info("isLookupUrl: {}", isLookupUrl);
|
||||
|
||||
final ISLookUpService isLookupService = ISLookupClientFactory.getLookUpService(isLookupUrl);
|
||||
|
||||
new GroupEntitiesSparkJob(parser).run(isSparkSessionManaged, isLookupService);
|
||||
}
|
||||
|
||||
public void run(Boolean isSparkSessionManaged, ISLookUpService isLookUpService)
|
||||
throws ISLookUpException {
|
||||
|
||||
String graphInputPath = parser.get("graphInputPath");
|
||||
log.info("graphInputPath: {}", graphInputPath);
|
||||
|
||||
String checkpointPath = parser.get("checkpointPath");
|
||||
log.info("checkpointPath: {}", checkpointPath);
|
||||
|
||||
String outputPath = parser.get("outputPath");
|
||||
log.info("outputPath: {}", outputPath);
|
||||
|
||||
boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
|
||||
log.info("filterInvisible: {}", filterInvisible);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
|
||||
conf.registerKryoClasses(ModelSupport.getOafModelClasses());
|
||||
|
||||
final VocabularyGroup vocs = VocabularyGroup.loadVocsFromIS(isLookUpService);
|
||||
|
||||
runWithSparkSession(
|
||||
conf,
|
||||
isSparkSessionManaged,
|
||||
spark -> {
|
||||
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
|
||||
groupEntities(spark, graphInputPath, outputPath);
|
||||
HdfsSupport.remove(checkpointPath, spark.sparkContext().hadoopConfiguration());
|
||||
groupEntities(spark, graphInputPath, checkpointPath, outputPath, filterInvisible, vocs);
|
||||
});
|
||||
}
|
||||
|
||||
private static void groupEntities(
|
||||
SparkSession spark,
|
||||
String inputPath,
|
||||
String outputPath) {
|
||||
String checkpointPath,
|
||||
String outputPath,
|
||||
boolean filterInvisible, VocabularyGroup vocs) {
|
||||
|
||||
Dataset<OafEntity> allEntities = spark.emptyDataset(OAFENTITY_KRYO_ENC);
|
||||
|
||||
for (Map.Entry<EntityType, Class> e : ModelSupport.entityTypes.entrySet()) {
|
||||
String entity = e.getKey().name();
|
||||
Class<? extends OafEntity> entityClass = e.getValue();
|
||||
String entityInputPath = inputPath + "/" + entity;
|
||||
|
||||
if (!HdfsSupport.exists(entityInputPath, spark.sparkContext().hadoopConfiguration())) {
|
||||
continue;
|
||||
}
|
||||
|
||||
allEntities = allEntities
|
||||
.union(
|
||||
((Dataset<OafEntity>) spark
|
||||
.read()
|
||||
.schema(Encoders.bean(entityClass).schema())
|
||||
.json(entityInputPath)
|
||||
.filter("length(id) > 0")
|
||||
.as(Encoders.bean(entityClass)))
|
||||
.map((MapFunction<OafEntity, OafEntity>) r -> r, OAFENTITY_KRYO_ENC));
|
||||
}
|
||||
|
||||
Dataset<?> groupedEntities = allEntities
|
||||
.map(
|
||||
(MapFunction<OafEntity, OafEntity>) entity -> GraphCleaningFunctions
|
||||
.applyCoarVocabularies(entity, vocs),
|
||||
OAFENTITY_KRYO_ENC)
|
||||
.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
|
||||
.reduceGroups((ReduceFunction<OafEntity>) OafMapperUtils::mergeEntities)
|
||||
.map(
|
||||
(MapFunction<Tuple2<String, OafEntity>, Tuple2<String, OafEntity>>) t -> new Tuple2<>(
|
||||
t._2().getClass().getName(), t._2()),
|
||||
Encoders.tuple(Encoders.STRING(), OAFENTITY_KRYO_ENC));
|
||||
|
||||
// pivot on "_1" (classname of the entity)
|
||||
// created columns containing only entities of the same class
|
||||
for (Map.Entry<EntityType, Class> e : ModelSupport.entityTypes.entrySet()) {
|
||||
String entity = e.getKey().name();
|
||||
Class<? extends OafEntity> entityClass = e.getValue();
|
||||
|
||||
groupedEntities = groupedEntities
|
||||
.withColumn(
|
||||
entity,
|
||||
when(col("_1").equalTo(entityClass.getName()), col("_2")));
|
||||
}
|
||||
|
||||
groupedEntities
|
||||
.drop("_1", "_2")
|
||||
.write()
|
||||
.mode(SaveMode.Overwrite)
|
||||
.option("compression", "gzip")
|
||||
.save(checkpointPath);
|
||||
|
||||
ForkJoinPool parPool = new ForkJoinPool(ModelSupport.entityTypes.size());
|
||||
|
||||
ModelSupport.entityTypes
|
||||
.entrySet()
|
||||
.stream()
|
||||
.map(e -> parPool.submit(() -> {
|
||||
String entity = e.getKey().name();
|
||||
Class<? extends OafEntity> entityClass = e.getValue();
|
||||
|
||||
final TypedColumn<OafEntity, OafEntity> aggregator = new GroupingAggregator().toColumn();
|
||||
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
|
||||
spark
|
||||
.read()
|
||||
.textFile(toSeq(listEntityPaths(inputPath, sc)))
|
||||
.map((MapFunction<String, OafEntity>) GroupEntitiesSparkJob::parseOaf, Encoders.kryo(OafEntity.class))
|
||||
.filter((FilterFunction<OafEntity>) e -> StringUtils.isNotBlank(ModelSupport.idFn().apply(e)))
|
||||
.groupByKey((MapFunction<OafEntity, String>) oaf -> ModelSupport.idFn().apply(oaf), Encoders.STRING())
|
||||
.agg(aggregator)
|
||||
.map(
|
||||
(MapFunction<Tuple2<String, OafEntity>, String>) t -> t._2().getClass().getName() +
|
||||
"|" + OBJECT_MAPPER.writeValueAsString(t._2()),
|
||||
Encoders.STRING())
|
||||
.load(checkpointPath)
|
||||
.select(col(entity).as("value"))
|
||||
.filter("value IS NOT NULL")
|
||||
.as(OAFENTITY_KRYO_ENC)
|
||||
.map((MapFunction<OafEntity, OafEntity>) r -> r, (Encoder<OafEntity>) Encoders.bean(entityClass))
|
||||
.filter(filterInvisible ? "dataInfo.invisible != TRUE" : "TRUE")
|
||||
.write()
|
||||
.option("compression", "gzip")
|
||||
.mode(SaveMode.Overwrite)
|
||||
.text(outputPath);
|
||||
}
|
||||
|
||||
public static class GroupingAggregator extends Aggregator<OafEntity, OafEntity, OafEntity> {
|
||||
|
||||
@Override
|
||||
public OafEntity zero() {
|
||||
return null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public OafEntity reduce(OafEntity b, OafEntity a) {
|
||||
return mergeAndGet(b, a);
|
||||
}
|
||||
|
||||
private OafEntity mergeAndGet(OafEntity b, OafEntity a) {
|
||||
if (Objects.nonNull(a) && Objects.nonNull(b)) {
|
||||
return OafMapperUtils.mergeEntities(b, a);
|
||||
}
|
||||
return Objects.isNull(a) ? b : a;
|
||||
}
|
||||
|
||||
@Override
|
||||
public OafEntity merge(OafEntity b, OafEntity a) {
|
||||
return mergeAndGet(b, a);
|
||||
}
|
||||
|
||||
@Override
|
||||
public OafEntity finish(OafEntity j) {
|
||||
return j;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Encoder<OafEntity> bufferEncoder() {
|
||||
return Encoders.kryo(OafEntity.class);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Encoder<OafEntity> outputEncoder() {
|
||||
return Encoders.kryo(OafEntity.class);
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
private static OafEntity parseOaf(String s) {
|
||||
|
||||
DocumentContext dc = JsonPath
|
||||
.parse(s, Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS));
|
||||
final String id = dc.read(ID_JPATH);
|
||||
if (StringUtils.isNotBlank(id)) {
|
||||
|
||||
String prefix = StringUtils.substringBefore(id, "|");
|
||||
switch (prefix) {
|
||||
case "10":
|
||||
return parse(s, Datasource.class);
|
||||
case "20":
|
||||
return parse(s, Organization.class);
|
||||
case "40":
|
||||
return parse(s, Project.class);
|
||||
case "50":
|
||||
String resultType = dc.read("$.resulttype.classid");
|
||||
switch (resultType) {
|
||||
case "publication":
|
||||
return parse(s, Publication.class);
|
||||
case "dataset":
|
||||
return parse(s, eu.dnetlib.dhp.schema.oaf.Dataset.class);
|
||||
case "software":
|
||||
return parse(s, Software.class);
|
||||
case "other":
|
||||
return parse(s, OtherResearchProduct.class);
|
||||
default:
|
||||
throw new IllegalArgumentException(String.format("invalid resultType: '%s'", resultType));
|
||||
}
|
||||
default:
|
||||
throw new IllegalArgumentException(String.format("invalid id prefix: '%s'", prefix));
|
||||
}
|
||||
} else {
|
||||
throw new IllegalArgumentException(String.format("invalid oaf: '%s'", s));
|
||||
}
|
||||
}
|
||||
|
||||
private static <T extends OafEntity> OafEntity parse(String s, Class<T> clazz) {
|
||||
.option("compression", "gzip")
|
||||
.json(outputPath + "/" + entity);
|
||||
}))
|
||||
.collect(Collectors.toList())
|
||||
.forEach(t -> {
|
||||
try {
|
||||
return OBJECT_MAPPER.readValue(s, clazz);
|
||||
} catch (IOException e) {
|
||||
throw new IllegalArgumentException(e);
|
||||
t.get();
|
||||
} catch (InterruptedException | ExecutionException e) {
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
private static List<String> listEntityPaths(String inputPath, JavaSparkContext sc) {
|
||||
return HdfsSupport
|
||||
.listFiles(inputPath, sc.hadoopConfiguration())
|
||||
.stream()
|
||||
.filter(f -> !f.toLowerCase().contains("relation"))
|
||||
.collect(Collectors.toList());
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -0,0 +1,77 @@
|
|||
|
||||
package eu.dnetlib.dhp.oozie;
|
||||
|
||||
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
|
||||
|
||||
import java.net.URL;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
|
||||
import org.apache.commons.lang3.time.DurationFormatUtils;
|
||||
import org.apache.commons.text.StringSubstitutor;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import com.google.common.io.Resources;
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
|
||||
public class RunSQLSparkJob {
|
||||
private static final Logger log = LoggerFactory.getLogger(RunSQLSparkJob.class);
|
||||
|
||||
private final ArgumentApplicationParser parser;
|
||||
|
||||
public RunSQLSparkJob(ArgumentApplicationParser parser) {
|
||||
this.parser = parser;
|
||||
}
|
||||
|
||||
public static void main(String[] args) throws Exception {
|
||||
|
||||
Map<String, String> params = new HashMap<>();
|
||||
for (int i = 0; i < args.length - 1; i++) {
|
||||
if (args[i].startsWith("--")) {
|
||||
params.put(args[i].substring(2), args[++i]);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* String jsonConfiguration = IOUtils .toString( Objects .requireNonNull( RunSQLSparkJob.class
|
||||
* .getResourceAsStream( "/eu/dnetlib/dhp/oozie/run_sql_parameters.json"))); final ArgumentApplicationParser
|
||||
* parser = new ArgumentApplicationParser(jsonConfiguration); parser.parseArgument(args);
|
||||
*/
|
||||
|
||||
Boolean isSparkSessionManaged = Optional
|
||||
.ofNullable(params.get("isSparkSessionManaged"))
|
||||
.map(Boolean::valueOf)
|
||||
.orElse(Boolean.TRUE);
|
||||
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
|
||||
|
||||
URL url = com.google.common.io.Resources.getResource(params.get("sql"));
|
||||
String raw_sql = Resources.toString(url, StandardCharsets.UTF_8);
|
||||
|
||||
String sql = StringSubstitutor.replace(raw_sql, params);
|
||||
log.info("sql: {}", sql);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
conf.set("hive.metastore.uris", params.get("hiveMetastoreUris"));
|
||||
|
||||
runWithSparkHiveSession(
|
||||
conf,
|
||||
isSparkSessionManaged,
|
||||
spark -> {
|
||||
for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {
|
||||
log.info("executing: {}", statement);
|
||||
long startTime = System.currentTimeMillis();
|
||||
spark.sql(statement).show();
|
||||
log
|
||||
.info(
|
||||
"executed in {}",
|
||||
DurationFormatUtils.formatDuration(System.currentTimeMillis() - startTime, "HH:mm:ss.S"));
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
}
|
|
@ -6,14 +6,16 @@ import java.util.regex.Pattern;
|
|||
|
||||
public class FundRefCleaningRule {
|
||||
|
||||
public static String clean(final String fundrefId) {
|
||||
public static final Pattern PATTERN = Pattern.compile("\\d+");
|
||||
|
||||
String s = fundrefId
|
||||
public static String clean(final String fundRefId) {
|
||||
|
||||
String s = fundRefId
|
||||
.toLowerCase()
|
||||
.replaceAll("\\s", "");
|
||||
|
||||
Matcher m = Pattern.compile("\\d+").matcher(s);
|
||||
if (m.matches()) {
|
||||
Matcher m = PATTERN.matcher(s);
|
||||
if (m.find()) {
|
||||
return m.group();
|
||||
} else {
|
||||
return "";
|
||||
|
|
|
@ -1,8 +1,12 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
|
||||
import static eu.dnetlib.dhp.schema.common.ModelConstants.OPENAIRE_META_RESOURCE_TYPE;
|
||||
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance;
|
||||
|
||||
import java.net.MalformedURLException;
|
||||
import java.net.URL;
|
||||
import java.time.LocalDate;
|
||||
import java.time.ZoneId;
|
||||
import java.time.format.DateTimeFormatter;
|
||||
|
@ -13,16 +17,13 @@ import java.util.stream.Collectors;
|
|||
import java.util.stream.Stream;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.sql.Encoders;
|
||||
|
||||
import com.fasterxml.jackson.core.JsonProcessingException;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.github.sisyphsu.dateparser.DateParserUtils;
|
||||
import com.google.common.collect.Lists;
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
|
||||
import eu.dnetlib.dhp.common.vocabulary.VocabularyTerm;
|
||||
import eu.dnetlib.dhp.schema.common.ModelConstants;
|
||||
import eu.dnetlib.dhp.schema.common.ModelSupport;
|
||||
import eu.dnetlib.dhp.schema.oaf.*;
|
||||
|
@ -30,6 +31,10 @@ import me.xuender.unidecode.Unidecode;
|
|||
|
||||
public class GraphCleaningFunctions extends CleaningFunctions {
|
||||
|
||||
public static final String DNET_PUBLISHERS = "dnet:publishers";
|
||||
|
||||
public static final String DNET_LICENSES = "dnet:licenses";
|
||||
|
||||
public static final String ORCID_CLEANING_REGEX = ".*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9x]{4})";
|
||||
public static final int ORCID_LEN = 19;
|
||||
public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)";
|
||||
|
@ -39,6 +44,68 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);
|
||||
|
||||
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
|
||||
private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";
|
||||
|
||||
private static final Set<String> INVALID_AUTHOR_NAMES = new HashSet<>();
|
||||
|
||||
private static final Set<String> INVALID_URLS = new HashSet<>();
|
||||
|
||||
private static final Set<String> INVALID_URL_HOSTS = new HashSet<>();
|
||||
|
||||
private static final HashSet<String> PEER_REVIEWED_TYPES = new HashSet<>();
|
||||
|
||||
static {
|
||||
PEER_REVIEWED_TYPES.add("Article");
|
||||
PEER_REVIEWED_TYPES.add("Part of book or chapter of book");
|
||||
PEER_REVIEWED_TYPES.add("Book");
|
||||
PEER_REVIEWED_TYPES.add("Doctoral thesis");
|
||||
PEER_REVIEWED_TYPES.add("Master thesis");
|
||||
PEER_REVIEWED_TYPES.add("Data Paper");
|
||||
PEER_REVIEWED_TYPES.add("Thesis");
|
||||
PEER_REVIEWED_TYPES.add("Bachelor thesis");
|
||||
PEER_REVIEWED_TYPES.add("Conference object");
|
||||
|
||||
INVALID_AUTHOR_NAMES.add("(:null)");
|
||||
INVALID_AUTHOR_NAMES.add("(:unap)");
|
||||
INVALID_AUTHOR_NAMES.add("(:tba)");
|
||||
INVALID_AUTHOR_NAMES.add("(:unas)");
|
||||
INVALID_AUTHOR_NAMES.add("(:unav)");
|
||||
INVALID_AUTHOR_NAMES.add("(:unkn)");
|
||||
INVALID_AUTHOR_NAMES.add("(:unkn) unknown");
|
||||
INVALID_AUTHOR_NAMES.add(":none");
|
||||
INVALID_AUTHOR_NAMES.add(":null");
|
||||
INVALID_AUTHOR_NAMES.add(":unas");
|
||||
INVALID_AUTHOR_NAMES.add(":unav");
|
||||
INVALID_AUTHOR_NAMES.add(":unkn");
|
||||
INVALID_AUTHOR_NAMES.add("[autor desconocido]");
|
||||
INVALID_AUTHOR_NAMES.add("[s. n.]");
|
||||
INVALID_AUTHOR_NAMES.add("[s.n]");
|
||||
INVALID_AUTHOR_NAMES.add("[unknown]");
|
||||
INVALID_AUTHOR_NAMES.add("anonymous");
|
||||
INVALID_AUTHOR_NAMES.add("n.n.");
|
||||
INVALID_AUTHOR_NAMES.add("nn");
|
||||
INVALID_AUTHOR_NAMES.add("no name supplied");
|
||||
INVALID_AUTHOR_NAMES.add("none");
|
||||
INVALID_AUTHOR_NAMES.add("none available");
|
||||
INVALID_AUTHOR_NAMES.add("not available not available");
|
||||
INVALID_AUTHOR_NAMES.add("null &na;");
|
||||
INVALID_AUTHOR_NAMES.add("null anonymous");
|
||||
INVALID_AUTHOR_NAMES.add("unbekannt");
|
||||
INVALID_AUTHOR_NAMES.add("unknown");
|
||||
|
||||
INVALID_URL_HOSTS.add("creativecommons.org");
|
||||
INVALID_URL_HOSTS.add("www.academia.edu");
|
||||
INVALID_URL_HOSTS.add("academia.edu");
|
||||
INVALID_URL_HOSTS.add("researchgate.net");
|
||||
INVALID_URL_HOSTS.add("www.researchgate.net");
|
||||
|
||||
INVALID_URLS.add("http://repo.scoap3.org/api");
|
||||
INVALID_URLS.add("http://ora.ox.ac.uk/objects/uuid:");
|
||||
INVALID_URLS.add("http://ntur.lib.ntu.edu.tw/news/agent_contract.pdf");
|
||||
INVALID_URLS.add("https://media.springer.com/full/springer-instructions-for-authors-assets/pdf/SN_BPF_EN.pdf");
|
||||
INVALID_URLS.add("http://www.tobaccoinduceddiseases.org/dl/61aad426c96519bea4040a374c6a6110/");
|
||||
INVALID_URLS.add("https://www.bilboard.nl/verenigingsbladen/bestuurskundige-berichten");
|
||||
}
|
||||
|
||||
public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
|
||||
if (ModelSupport.isSubClass(value, Result.class)) {
|
||||
|
@ -228,7 +295,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
}
|
||||
|
||||
public static <T extends Oaf> boolean filter(T value) {
|
||||
if (Boolean.TRUE
|
||||
if (!(value instanceof Relation) && (Boolean.TRUE
|
||||
.equals(
|
||||
Optional
|
||||
.ofNullable(value)
|
||||
|
@ -239,15 +306,16 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
d -> Optional
|
||||
.ofNullable(d.getInvisible())
|
||||
.orElse(true))
|
||||
.orElse(true))
|
||||
.orElse(true))) {
|
||||
.orElse(false))
|
||||
.orElse(true)))) {
|
||||
return true;
|
||||
}
|
||||
|
||||
if (value instanceof Datasource) {
|
||||
// nothing to evaluate here
|
||||
} else if (value instanceof Project) {
|
||||
// nothing to evaluate here
|
||||
final Project p = (Project) value;
|
||||
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
|
||||
} else if (value instanceof Organization) {
|
||||
// nothing to evaluate here
|
||||
} else if (value instanceof Relation) {
|
||||
|
@ -275,6 +343,12 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
|
||||
public static <T extends Oaf> T cleanup(T value, VocabularyGroup vocs) {
|
||||
|
||||
if (Objects.isNull(value.getDataInfo())) {
|
||||
final DataInfo d = new DataInfo();
|
||||
d.setDeletedbyinference(false);
|
||||
value.setDataInfo(d);
|
||||
}
|
||||
|
||||
if (value instanceof OafEntity) {
|
||||
|
||||
OafEntity e = (OafEntity) value;
|
||||
|
@ -294,6 +368,17 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
} else if (value instanceof Result) {
|
||||
Result r = (Result) value;
|
||||
|
||||
if (Objects.isNull(r.getContext())) {
|
||||
r.setContext(new ArrayList<>());
|
||||
}
|
||||
|
||||
if (Objects.nonNull(r.getFulltext())
|
||||
&& (ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
|
||||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
|
||||
r.setFulltext(null);
|
||||
|
||||
}
|
||||
|
||||
if (Objects.nonNull(r.getDateofacceptance())) {
|
||||
Optional<String> date = cleanDateField(r.getDateofacceptance());
|
||||
if (date.isPresent()) {
|
||||
|
@ -318,8 +403,26 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
|
||||
.collect(Collectors.toList()));
|
||||
}
|
||||
if (Objects.nonNull(r.getPublisher()) && StringUtils.isBlank(r.getPublisher().getValue())) {
|
||||
if (Objects.nonNull(r.getPublisher())) {
|
||||
if (StringUtils.isBlank(r.getPublisher().getValue())) {
|
||||
r.setPublisher(null);
|
||||
} else {
|
||||
r
|
||||
.getPublisher()
|
||||
.setValue(
|
||||
r
|
||||
.getPublisher()
|
||||
.getValue()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " "));
|
||||
|
||||
if (vocs.vocabularyExists(DNET_PUBLISHERS)) {
|
||||
vocs
|
||||
.find(DNET_PUBLISHERS)
|
||||
.map(voc -> voc.getTermBySynonym(r.getPublisher().getValue()))
|
||||
.map(VocabularyTerm::getName)
|
||||
.ifPresent(publisher -> r.getPublisher().setValue(publisher));
|
||||
}
|
||||
}
|
||||
}
|
||||
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
|
||||
r
|
||||
|
@ -478,6 +581,43 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
if (Objects.isNull(i.getRefereed()) || StringUtils.isBlank(i.getRefereed().getClassid())) {
|
||||
i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS));
|
||||
}
|
||||
|
||||
if (Objects.nonNull(i.getLicense()) && Objects.nonNull(i.getLicense().getValue())) {
|
||||
vocs
|
||||
.find(DNET_LICENSES)
|
||||
.map(voc -> voc.getTermBySynonym(i.getLicense().getValue()))
|
||||
.map(VocabularyTerm::getId)
|
||||
.ifPresent(license -> i.getLicense().setValue(license));
|
||||
}
|
||||
|
||||
// from the script from Dimitris
|
||||
if ("0000".equals(i.getRefereed().getClassid())) {
|
||||
final boolean isFromCrossref = Optional
|
||||
.ofNullable(i.getCollectedfrom())
|
||||
.map(KeyValue::getKey)
|
||||
.map(id -> id.equals(ModelConstants.CROSSREF_ID))
|
||||
.orElse(false);
|
||||
final boolean hasDoi = Optional
|
||||
.ofNullable(i.getPid())
|
||||
.map(
|
||||
pid -> pid
|
||||
.stream()
|
||||
.anyMatch(
|
||||
p -> PidType.doi.toString().equals(p.getQualifier().getClassid())))
|
||||
.orElse(false);
|
||||
final boolean isPeerReviewedType = PEER_REVIEWED_TYPES
|
||||
.contains(i.getInstancetype().getClassname());
|
||||
final boolean noOtherLitType = r
|
||||
.getInstance()
|
||||
.stream()
|
||||
.noneMatch(ii -> "Other literature type".equals(ii.getInstancetype().getClassname()));
|
||||
if (isFromCrossref && hasDoi && isPeerReviewedType && noOtherLitType) {
|
||||
i.setRefereed(qualifier("0001", "peerReviewed", ModelConstants.DNET_REVIEW_LEVELS));
|
||||
} else {
|
||||
i.setRefereed(qualifier("0002", "nonPeerReviewed", ModelConstants.DNET_REVIEW_LEVELS));
|
||||
}
|
||||
}
|
||||
|
||||
if (Objects.nonNull(i.getDateofacceptance())) {
|
||||
Optional<String> date = cleanDateField(i.getDateofacceptance());
|
||||
if (date.isPresent()) {
|
||||
|
@ -486,6 +626,20 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
i.setDateofacceptance(null);
|
||||
}
|
||||
}
|
||||
if (StringUtils.isNotBlank(i.getFulltext()) &&
|
||||
(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
|
||||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
|
||||
i.setFulltext(null);
|
||||
}
|
||||
if (Objects.nonNull(i.getUrl())) {
|
||||
i
|
||||
.setUrl(
|
||||
i
|
||||
.getUrl()
|
||||
.stream()
|
||||
.filter(GraphCleaningFunctions::urlFilter)
|
||||
.collect(Collectors.toList()));
|
||||
}
|
||||
}
|
||||
}
|
||||
if (Objects.isNull(r.getBestaccessright())
|
||||
|
@ -508,8 +662,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
.getAuthor()
|
||||
.stream()
|
||||
.filter(Objects::nonNull)
|
||||
.filter(a -> StringUtils.isNotBlank(a.getFullname()))
|
||||
.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
|
||||
.filter(GraphCleaningFunctions::isValidAuthorName)
|
||||
.map(GraphCleaningFunctions::cleanupAuthor)
|
||||
.collect(Collectors.toList()));
|
||||
|
||||
boolean nullRank = r
|
||||
|
@ -535,6 +689,9 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
.filter(Objects::nonNull)
|
||||
.filter(p -> Objects.nonNull(p.getQualifier()))
|
||||
.filter(p -> StringUtils.isNotBlank(p.getValue()))
|
||||
.filter(
|
||||
p -> StringUtils
|
||||
.contains(StringUtils.lowerCase(p.getQualifier().getClassid()), ORCID))
|
||||
.map(p -> {
|
||||
// hack to distinguish orcid from orcid_pending
|
||||
String pidProvenance = getProvenance(p.getDataInfo());
|
||||
|
@ -544,7 +701,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
.toLowerCase()
|
||||
.contains(ModelConstants.ORCID)) {
|
||||
if (pidProvenance
|
||||
.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY)) {
|
||||
.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY) ||
|
||||
pidProvenance.equals("ORCID_ENRICHMENT")) {
|
||||
p.getQualifier().setClassid(ModelConstants.ORCID);
|
||||
} else {
|
||||
p.getQualifier().setClassid(ModelConstants.ORCID_PENDING);
|
||||
|
@ -604,6 +762,35 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
return value;
|
||||
}
|
||||
|
||||
private static Author cleanupAuthor(Author author) {
|
||||
if (StringUtils.isNotBlank(author.getFullname())) {
|
||||
author
|
||||
.setFullname(
|
||||
author
|
||||
.getFullname()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " ")
|
||||
.replace("\"", "\\\""));
|
||||
}
|
||||
if (StringUtils.isNotBlank(author.getName())) {
|
||||
author
|
||||
.setName(
|
||||
author
|
||||
.getName()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " ")
|
||||
.replace("\"", "\\\""));
|
||||
}
|
||||
if (StringUtils.isNotBlank(author.getSurname())) {
|
||||
author
|
||||
.setSurname(
|
||||
author
|
||||
.getSurname()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " ")
|
||||
.replace("\"", "\\\""));
|
||||
}
|
||||
|
||||
return author;
|
||||
}
|
||||
|
||||
private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
|
||||
return Optional
|
||||
.ofNullable(dateofacceptance)
|
||||
|
@ -637,14 +824,32 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
// HELPERS
|
||||
|
||||
private static boolean isValidAuthorName(Author a) {
|
||||
return !Stream
|
||||
return StringUtils.isNotBlank(a.getFullname()) &&
|
||||
StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")) &&
|
||||
!INVALID_AUTHOR_NAMES.contains(StringUtils.lowerCase(a.getFullname()).trim()) &&
|
||||
!Stream
|
||||
.of(a.getFullname(), a.getName(), a.getSurname())
|
||||
.filter(s -> s != null && !s.isEmpty())
|
||||
.filter(StringUtils::isNotBlank)
|
||||
.collect(Collectors.joining(""))
|
||||
.toLowerCase()
|
||||
.matches(INVALID_AUTHOR_REGEX);
|
||||
}
|
||||
|
||||
private static boolean urlFilter(String u) {
|
||||
try {
|
||||
final URL url = new URL(u);
|
||||
if (StringUtils.isBlank(url.getPath()) || "/".equals(url.getPath())) {
|
||||
return false;
|
||||
}
|
||||
if (INVALID_URL_HOSTS.contains(url.getHost())) {
|
||||
return false;
|
||||
}
|
||||
return !INVALID_URLS.contains(url.toString());
|
||||
} catch (MalformedURLException ex) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
private static List<StructuredProperty> processPidCleaning(List<StructuredProperty> pids) {
|
||||
return pids
|
||||
.stream()
|
||||
|
@ -692,4 +897,105 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
return s;
|
||||
}
|
||||
|
||||
public static OafEntity applyCoarVocabularies(OafEntity entity, VocabularyGroup vocs) {
|
||||
|
||||
if (entity instanceof Result) {
|
||||
final Result result = (Result) entity;
|
||||
|
||||
Optional
|
||||
.ofNullable(result.getInstance())
|
||||
.ifPresent(
|
||||
instances -> instances
|
||||
.forEach(
|
||||
instance -> {
|
||||
if (Objects.isNull(instance.getInstanceTypeMapping())) {
|
||||
List<InstanceTypeMapping> mapping = Lists.newArrayList();
|
||||
mapping
|
||||
.add(
|
||||
OafMapperUtils
|
||||
.instanceTypeMapping(
|
||||
instance.getInstancetype().getClassname(),
|
||||
OPENAIRE_COAR_RESOURCE_TYPES_3_1));
|
||||
instance.setInstanceTypeMapping(mapping);
|
||||
}
|
||||
Optional<InstanceTypeMapping> optionalItm = instance
|
||||
.getInstanceTypeMapping()
|
||||
.stream()
|
||||
.filter(GraphCleaningFunctions::originalResourceType)
|
||||
.findFirst();
|
||||
if (optionalItm.isPresent()) {
|
||||
InstanceTypeMapping coarItm = optionalItm.get();
|
||||
Optional
|
||||
.ofNullable(
|
||||
vocs
|
||||
.lookupTermBySynonym(
|
||||
OPENAIRE_COAR_RESOURCE_TYPES_3_1, coarItm.getOriginalType()))
|
||||
.ifPresent(type -> {
|
||||
coarItm.setTypeCode(type.getClassid());
|
||||
coarItm.setTypeLabel(type.getClassname());
|
||||
});
|
||||
final List<InstanceTypeMapping> mappings = Lists.newArrayList();
|
||||
if (vocs.vocabularyExists(OPENAIRE_USER_RESOURCE_TYPES)) {
|
||||
Optional
|
||||
.ofNullable(
|
||||
vocs
|
||||
.lookupTermBySynonym(
|
||||
OPENAIRE_USER_RESOURCE_TYPES, coarItm.getTypeCode()))
|
||||
.ifPresent(
|
||||
type -> mappings
|
||||
.add(
|
||||
OafMapperUtils
|
||||
.instanceTypeMapping(coarItm.getTypeCode(), type)));
|
||||
}
|
||||
if (!mappings.isEmpty()) {
|
||||
instance.getInstanceTypeMapping().addAll(mappings);
|
||||
}
|
||||
}
|
||||
}));
|
||||
result.setMetaResourceType(getMetaResourceType(result.getInstance(), vocs));
|
||||
}
|
||||
|
||||
return entity;
|
||||
}
|
||||
|
||||
private static boolean originalResourceType(InstanceTypeMapping itm) {
|
||||
return StringUtils.isNotBlank(itm.getOriginalType()) &&
|
||||
OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(itm.getVocabularyName()) &&
|
||||
StringUtils.isBlank(itm.getTypeCode()) &&
|
||||
StringUtils.isBlank(itm.getTypeLabel());
|
||||
}
|
||||
|
||||
private static Qualifier getMetaResourceType(final List<Instance> instances, final VocabularyGroup vocs) {
|
||||
return Optional
|
||||
.ofNullable(instances)
|
||||
.map(ii -> {
|
||||
if (vocs.vocabularyExists(OPENAIRE_META_RESOURCE_TYPE)) {
|
||||
Optional<InstanceTypeMapping> itm = ii
|
||||
.stream()
|
||||
.filter(Objects::nonNull)
|
||||
.flatMap(
|
||||
i -> Optional
|
||||
.ofNullable(i.getInstanceTypeMapping())
|
||||
.map(Collection::stream)
|
||||
.orElse(Stream.empty()))
|
||||
.filter(t -> OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(t.getVocabularyName()))
|
||||
.findFirst();
|
||||
|
||||
if (!itm.isPresent() || Objects.isNull(itm.get().getTypeCode())) {
|
||||
return null;
|
||||
} else {
|
||||
final String typeCode = itm.get().getTypeCode();
|
||||
return Optional
|
||||
.ofNullable(vocs.lookupTermBySynonym(OPENAIRE_META_RESOURCE_TYPE, typeCode))
|
||||
.orElseThrow(
|
||||
() -> new IllegalStateException("unable to find a synonym for '" + typeCode + "' in " +
|
||||
OPENAIRE_META_RESOURCE_TYPE));
|
||||
}
|
||||
} else {
|
||||
throw new IllegalStateException("vocabulary '" + OPENAIRE_META_RESOURCE_TYPE + "' not available");
|
||||
}
|
||||
})
|
||||
.orElse(null);
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -6,13 +6,19 @@ import java.util.regex.Pattern;
|
|||
|
||||
public class GridCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("(?<grid>\\d{4,6}\\.[0-9a-z]{1,2})");
|
||||
|
||||
public static String clean(String grid) {
|
||||
String s = grid
|
||||
.replaceAll("\\s", "")
|
||||
.toLowerCase();
|
||||
|
||||
Matcher m = Pattern.compile("\\d{4,6}\\.[0-9a-z]{1,2}").matcher(s);
|
||||
return m.matches() ? "grid." + m.group() : "";
|
||||
Matcher m = PATTERN.matcher(s);
|
||||
if (m.find()) {
|
||||
return "grid." + m.group("grid");
|
||||
}
|
||||
|
||||
return "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -7,10 +7,12 @@ import java.util.regex.Pattern;
|
|||
// https://www.wikidata.org/wiki/Property:P213
|
||||
public class ISNICleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])");
|
||||
|
||||
public static String clean(final String isni) {
|
||||
|
||||
Matcher m = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])").matcher(isni);
|
||||
if (m.matches()) {
|
||||
Matcher m = PATTERN.matcher(isni);
|
||||
if (m.find()) {
|
||||
return String.join("", m.group(1), m.group(2), m.group(3), m.group(4));
|
||||
} else {
|
||||
return "";
|
||||
|
|
|
@ -14,7 +14,6 @@ import java.util.stream.Collectors;
|
|||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import eu.dnetlib.dhp.schema.common.AccessRightComparator;
|
||||
import eu.dnetlib.dhp.schema.common.ModelConstants;
|
||||
import eu.dnetlib.dhp.schema.common.ModelSupport;
|
||||
import eu.dnetlib.dhp.schema.oaf.*;
|
||||
|
||||
|
@ -141,6 +140,28 @@ public class OafMapperUtils {
|
|||
.collect(Collectors.toList());
|
||||
}
|
||||
|
||||
public static InstanceTypeMapping instanceTypeMapping(String originalType, String code, String label,
|
||||
String vocabularyName) {
|
||||
final InstanceTypeMapping m = new InstanceTypeMapping();
|
||||
m.setVocabularyName(vocabularyName);
|
||||
m.setOriginalType(originalType);
|
||||
m.setTypeCode(code);
|
||||
m.setTypeLabel(label);
|
||||
return m;
|
||||
}
|
||||
|
||||
public static InstanceTypeMapping instanceTypeMapping(String originalType, Qualifier term) {
|
||||
return instanceTypeMapping(originalType, term.getClassid(), term.getClassname(), term.getSchemeid());
|
||||
}
|
||||
|
||||
public static InstanceTypeMapping instanceTypeMapping(String originalType) {
|
||||
return instanceTypeMapping(originalType, null, null, null);
|
||||
}
|
||||
|
||||
public static InstanceTypeMapping instanceTypeMapping(String originalType, String vocabularyName) {
|
||||
return instanceTypeMapping(originalType, null, null, vocabularyName);
|
||||
}
|
||||
|
||||
public static Qualifier unknown(final String schemeid, final String schemename) {
|
||||
return qualifier(UNKNOWN, "Unknown", schemeid, schemename);
|
||||
}
|
||||
|
|
|
@ -6,10 +6,12 @@ import java.util.regex.Pattern;
|
|||
|
||||
public class PICCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("\\d{9}");
|
||||
|
||||
public static String clean(final String pic) {
|
||||
|
||||
Matcher m = Pattern.compile("\\d{9}").matcher(pic);
|
||||
if (m.matches()) {
|
||||
Matcher m = PATTERN.matcher(pic);
|
||||
if (m.find()) {
|
||||
return m.group();
|
||||
} else {
|
||||
return "";
|
||||
|
|
|
@ -1,13 +1,24 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import java.util.regex.Matcher;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
public class PmcCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("PMC\\d{1,8}");
|
||||
|
||||
public static String clean(String pmc) {
|
||||
String s = pmc
|
||||
.replaceAll("\\s", "")
|
||||
.toUpperCase();
|
||||
return s.matches("^PMC\\d{1,8}$") ? s : "";
|
||||
|
||||
final Matcher m = PATTERN.matcher(s);
|
||||
|
||||
if (m.find()) {
|
||||
return m.group();
|
||||
}
|
||||
return "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -1,16 +1,25 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import java.util.regex.Matcher;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
// https://researchguides.stevens.edu/c.php?g=442331&p=6577176
|
||||
public class PmidCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("0*(\\d{1,8})");
|
||||
|
||||
public static String clean(String pmid) {
|
||||
String s = pmid
|
||||
.toLowerCase()
|
||||
.replaceAll("\\s", "")
|
||||
.trim()
|
||||
.replaceAll("^0+", "");
|
||||
return s.matches("^\\d{1,8}$") ? s : "";
|
||||
.replaceAll("\\s", "");
|
||||
|
||||
final Matcher m = PATTERN.matcher(s);
|
||||
|
||||
if (m.find()) {
|
||||
return m.group(1);
|
||||
}
|
||||
return "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -7,12 +7,21 @@ import java.util.regex.Pattern;
|
|||
// https://ror.readme.io/docs/ror-identifier-pattern
|
||||
public class RorCleaningRule {
|
||||
|
||||
public static final String ROR_PREFIX = "https://ror.org/";
|
||||
|
||||
private static final Pattern PATTERN = Pattern.compile("(?<ror>0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2})");
|
||||
|
||||
public static String clean(String ror) {
|
||||
String s = ror
|
||||
.replaceAll("\\s", "")
|
||||
.toLowerCase();
|
||||
Matcher m = Pattern.compile("0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2}").matcher(s);
|
||||
return m.matches() ? "https://ror.org/" + m.group() : "";
|
||||
|
||||
Matcher m = PATTERN.matcher(s);
|
||||
|
||||
if (m.find()) {
|
||||
return ROR_PREFIX + m.group("ror");
|
||||
}
|
||||
return "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -8,13 +8,31 @@
|
|||
{
|
||||
"paramName": "gin",
|
||||
"paramLongName": "graphInputPath",
|
||||
"paramDescription": "the graph root path",
|
||||
"paramDescription": "the input graph root path",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "cp",
|
||||
"paramLongName": "checkpointPath",
|
||||
"paramDescription": "checkpoint directory",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "out",
|
||||
"paramLongName": "outputPath",
|
||||
"paramDescription": "the output merged graph root path",
|
||||
"paramDescription": "the output graph root path",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "fi",
|
||||
"paramLongName": "filterInvisible",
|
||||
"paramDescription": "if true filters out invisible entities",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "isu",
|
||||
"paramLongName": "isLookupUrl",
|
||||
"paramDescription": "url to the ISLookup Service",
|
||||
"paramRequired": true
|
||||
}
|
||||
]
|
|
@ -0,0 +1,20 @@
|
|||
[
|
||||
{
|
||||
"paramName": "issm",
|
||||
"paramLongName": "isSparkSessionManaged",
|
||||
"paramDescription": "when true will stop SparkSession after job execution",
|
||||
"paramRequired": false
|
||||
},
|
||||
{
|
||||
"paramName": "hmu",
|
||||
"paramLongName": "hiveMetastoreUris",
|
||||
"paramDescription": "the hive metastore uris",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "sql",
|
||||
"paramLongName": "sql",
|
||||
"paramDescription": "sql script to execute",
|
||||
"paramRequired": true
|
||||
}
|
||||
]
|
|
@ -15,7 +15,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
|
|||
|
||||
public class MdStoreClientTest {
|
||||
|
||||
@Test
|
||||
// @Test
|
||||
public void testMongoCollection() throws IOException {
|
||||
final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore");
|
||||
|
||||
|
|
|
@ -0,0 +1,18 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class GridCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("grid.493784.5", GridCleaningRule.clean("grid.493784.5"));
|
||||
assertEquals("grid.493784.5x", GridCleaningRule.clean("grid.493784.5x"));
|
||||
assertEquals("grid.493784.5x", GridCleaningRule.clean("493784.5x"));
|
||||
assertEquals("", GridCleaningRule.clean("493x784.5x"));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,19 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class ISNICleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("0000000463436020", ISNICleaningRule.clean("0000 0004 6343 6020"));
|
||||
assertEquals("0000000463436020", ISNICleaningRule.clean("0000000463436020"));
|
||||
assertEquals("", ISNICleaningRule.clean("Q30256598"));
|
||||
assertEquals("0000000493403529", ISNICleaningRule.clean("ISNI:0000000493403529"));
|
||||
assertEquals("000000008614884X", ISNICleaningRule.clean("0000 0000 8614 884X"));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,19 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class PICCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("887624982", PICCleaningRule.clean("887624982"));
|
||||
assertEquals("", PICCleaningRule.clean("887 624982"));
|
||||
assertEquals("887624982", PICCleaningRule.clean(" 887624982 "));
|
||||
assertEquals("887624982", PICCleaningRule.clean(" 887624982x "));
|
||||
assertEquals("887624982", PICCleaningRule.clean(" 88762498200 "));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,19 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class PmcCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("PMC1234", PmcCleaningRule.clean("PMC1234"));
|
||||
assertEquals("PMC1234", PmcCleaningRule.clean(" PMC1234"));
|
||||
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC12345678"));
|
||||
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC123456789"));
|
||||
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC 12345678"));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,24 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class PmidCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
// leading zeros are removed
|
||||
assertEquals("1234", PmidCleaningRule.clean("01234"));
|
||||
// tolerant to spaces in the middle
|
||||
assertEquals("1234567", PmidCleaningRule.clean("0123 4567"));
|
||||
// stop parsing at first not numerical char
|
||||
assertEquals("123", PmidCleaningRule.clean("0123x4567"));
|
||||
// invalid id leading to empty result
|
||||
assertEquals("", PmidCleaningRule.clean("abc"));
|
||||
// valid id with zeroes in the number
|
||||
assertEquals("20794075", PmidCleaningRule.clean("20794075"));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,17 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class RorCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("https://ror.org/05rpz9w55"));
|
||||
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("05rpz9w55"));
|
||||
assertEquals("", RorCleaningRule.clean("05rpz9w_55"));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,114 @@
|
|||
|
||||
package eu.dnetlib.oa.merge;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.*;
|
||||
|
||||
import java.io.BufferedReader;
|
||||
import java.io.InputStreamReader;
|
||||
import java.util.List;
|
||||
import java.util.Objects;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.junit.platform.commons.util.StringUtils;
|
||||
|
||||
import com.fasterxml.jackson.core.type.TypeReference;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
|
||||
import eu.dnetlib.dhp.schema.oaf.Author;
|
||||
|
||||
public class AuthorMergerTest {
|
||||
|
||||
@Test
|
||||
public void testEnrcichAuthor() throws Exception {
|
||||
final ObjectMapper mapper = new ObjectMapper();
|
||||
|
||||
BufferedReader pr = new BufferedReader(new InputStreamReader(
|
||||
Objects
|
||||
.requireNonNull(
|
||||
AuthorMergerTest.class
|
||||
.getResourceAsStream("/eu/dnetlib/dhp/oa/merge/authors_publication_sample.json"))));
|
||||
BufferedReader or = new BufferedReader(new InputStreamReader(
|
||||
Objects
|
||||
.requireNonNull(
|
||||
AuthorMergerTest.class.getResourceAsStream("/eu/dnetlib/dhp/oa/merge/authors_orcid_sample.json"))));
|
||||
|
||||
TypeReference<List<Author>> aclass = new TypeReference<List<Author>>() {
|
||||
};
|
||||
String pubLine;
|
||||
|
||||
int i = 0;
|
||||
while ((pubLine = pr.readLine()) != null) {
|
||||
final String pubId = pubLine;
|
||||
final String MatchPidOrcid = or.readLine();
|
||||
final String pubOrcid = or.readLine();
|
||||
|
||||
final String data = pr.readLine();
|
||||
|
||||
if (StringUtils.isNotBlank(data)) {
|
||||
List<Author> publicationAuthors = mapper.readValue(data, aclass);
|
||||
List<Author> orcidAuthors = mapper.readValue(or.readLine(), aclass);
|
||||
System.out.printf("OAF ID = %s \n", pubId);
|
||||
System.out.printf("ORCID Intersected ID = %s \n", pubOrcid);
|
||||
System.out.printf("OAF Author Size = %d \n", publicationAuthors.size());
|
||||
System.out.printf("Oricd Author Size = %d \n", orcidAuthors.size());
|
||||
System.out.printf("Oricd Matched PID = %s \n", MatchPidOrcid);
|
||||
|
||||
long originalAuthorWithPiD = publicationAuthors
|
||||
.stream()
|
||||
.filter(
|
||||
a -> a.getPid() != null && a
|
||||
.getPid()
|
||||
.stream()
|
||||
.anyMatch(
|
||||
p -> p.getQualifier() != null
|
||||
&& p.getQualifier().getClassid().toLowerCase().contains("orcid")))
|
||||
.count();
|
||||
long start = System.currentTimeMillis();
|
||||
|
||||
// final List<Author> enrichedList = AuthorMerger.enrichOrcid(publicationAuthors, orcidAuthors);
|
||||
final List<Author> enrichedList = AuthorMerger.enrichOrcid(publicationAuthors, orcidAuthors);
|
||||
|
||||
long enrichedAuthorWithPid = enrichedList
|
||||
.stream()
|
||||
.filter(
|
||||
a -> a.getPid() != null && a
|
||||
.getPid()
|
||||
.stream()
|
||||
.anyMatch(
|
||||
p -> p.getQualifier() != null
|
||||
&& p.getQualifier().getClassid().toLowerCase().contains("orcid")))
|
||||
.count();
|
||||
|
||||
long totalTime = (System.currentTimeMillis() - start) / 1000;
|
||||
System.out
|
||||
.printf(
|
||||
"Enriched authors in %d seconds from %d pid to %d pid \n", totalTime, originalAuthorWithPiD,
|
||||
enrichedAuthorWithPid);
|
||||
|
||||
System.out.println("=================");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
public void checkSimilarityTest() {
|
||||
final Author left = new Author();
|
||||
left.setName("Anand");
|
||||
left.setSurname("Rachna");
|
||||
left.setFullname("Anand, Rachna");
|
||||
|
||||
System.out.println(AuthorMerger.normalizeFullName(left.getFullname()));
|
||||
|
||||
final Author right = new Author();
|
||||
right.setName("Rachna");
|
||||
right.setSurname("Anand");
|
||||
right.setFullname("Rachna, Anand");
|
||||
// System.out.println(AuthorMerger.normalize(right.getFullname()));
|
||||
boolean same = AuthorMerger.checkORCIDSimilarity(left, right);
|
||||
|
||||
assertTrue(same);
|
||||
|
||||
}
|
||||
|
||||
}
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
@ -0,0 +1,110 @@
|
|||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
|
||||
|
||||
<modelVersion>4.0.0</modelVersion>
|
||||
|
||||
<parent>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp</artifactId>
|
||||
<version>1.2.5-SNAPSHOT</version>
|
||||
<relativePath>../pom.xml</relativePath>
|
||||
</parent>
|
||||
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-pace-core</artifactId>
|
||||
<version>1.2.5-SNAPSHOT</version>
|
||||
<packaging>jar</packaging>
|
||||
|
||||
<build>
|
||||
<plugins>
|
||||
<plugin>
|
||||
<groupId>net.alchim31.maven</groupId>
|
||||
<artifactId>scala-maven-plugin</artifactId>
|
||||
<version>${net.alchim31.maven.version}</version>
|
||||
<executions>
|
||||
<execution>
|
||||
<id>scala-compile-first</id>
|
||||
<phase>initialize</phase>
|
||||
<goals>
|
||||
<goal>add-source</goal>
|
||||
<goal>compile</goal>
|
||||
</goals>
|
||||
</execution>
|
||||
<execution>
|
||||
<id>scala-test-compile</id>
|
||||
<phase>process-test-resources</phase>
|
||||
<goals>
|
||||
<goal>testCompile</goal>
|
||||
</goals>
|
||||
</execution>
|
||||
</executions>
|
||||
<configuration>
|
||||
<failOnMultipleScalaVersions>true</failOnMultipleScalaVersions>
|
||||
<scalaCompatVersion>${scala.binary.version}</scalaCompatVersion>
|
||||
<scalaVersion>${scala.version}</scalaVersion>
|
||||
</configuration>
|
||||
</plugin>
|
||||
</plugins>
|
||||
|
||||
</build>
|
||||
|
||||
<dependencies>
|
||||
<dependency>
|
||||
<groupId>edu.cmu</groupId>
|
||||
<artifactId>secondstring</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>com.google.guava</groupId>
|
||||
<artifactId>guava</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>com.google.code.gson</groupId>
|
||||
<artifactId>gson</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.apache.commons</groupId>
|
||||
<artifactId>commons-lang3</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>commons-io</groupId>
|
||||
<artifactId>commons-io</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.antlr</groupId>
|
||||
<artifactId>stringtemplate</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>commons-logging</groupId>
|
||||
<artifactId>commons-logging</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.reflections</groupId>
|
||||
<artifactId>reflections</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>com.fasterxml.jackson.core</groupId>
|
||||
<artifactId>jackson-databind</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.apache.commons</groupId>
|
||||
<artifactId>commons-math3</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>com.jayway.jsonpath</groupId>
|
||||
<artifactId>json-path</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>com.ibm.icu</groupId>
|
||||
<artifactId>icu4j</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.apache.spark</groupId>
|
||||
<artifactId>spark-core_${scala.binary.version}</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.apache.spark</groupId>
|
||||
<artifactId>spark-sql_${scala.binary.version}</artifactId>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
|
||||
</project>
|
|
@ -0,0 +1,59 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
public abstract class AbstractClusteringFunction extends AbstractPaceFunctions implements ClusteringFunction {
|
||||
|
||||
protected Map<String, Object> params;
|
||||
|
||||
public AbstractClusteringFunction(final Map<String, Object> params) {
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
protected abstract Collection<String> doApply(Config conf, String s);
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(Config conf, List<String> fields) {
|
||||
return fields
|
||||
.stream()
|
||||
.filter(f -> !f.isEmpty())
|
||||
.map(s -> normalize(s))
|
||||
.map(s -> filterAllStopWords(s))
|
||||
.map(s -> doApply(conf, s))
|
||||
.map(c -> filterBlacklisted(c, ngramBlacklist))
|
||||
.flatMap(c -> c.stream())
|
||||
.filter(StringUtils::isNotBlank)
|
||||
.collect(Collectors.toCollection(HashSet::new));
|
||||
}
|
||||
|
||||
public Map<String, Object> getParams() {
|
||||
return params;
|
||||
}
|
||||
|
||||
protected Integer param(String name) {
|
||||
Object val = params.get(name);
|
||||
if (val == null)
|
||||
return null;
|
||||
if (val instanceof Number) {
|
||||
return ((Number) val).intValue();
|
||||
}
|
||||
return Integer.parseInt(val.toString());
|
||||
}
|
||||
|
||||
protected int paramOrDefault(String name, int i) {
|
||||
Integer res = param(name);
|
||||
if (res == null)
|
||||
res = i;
|
||||
return res;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,51 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.StringTokenizer;
|
||||
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("acronyms")
|
||||
public class Acronyms extends AbstractClusteringFunction {
|
||||
|
||||
public Acronyms(Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
return extractAcronyms(s, param("max"), param("minLen"), param("maxLen"));
|
||||
}
|
||||
|
||||
private Set<String> extractAcronyms(final String s, int maxAcronyms, int minLen, int maxLen) {
|
||||
|
||||
final Set<String> acronyms = Sets.newLinkedHashSet();
|
||||
|
||||
for (int i = 0; i < maxAcronyms; i++) {
|
||||
|
||||
final StringTokenizer st = new StringTokenizer(s);
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
|
||||
while (st.hasMoreTokens()) {
|
||||
final String token = st.nextToken();
|
||||
if (sb.length() > maxLen) {
|
||||
break;
|
||||
}
|
||||
if (token.length() > 1 && i < token.length()) {
|
||||
sb.append(token.charAt(i));
|
||||
}
|
||||
}
|
||||
String acronym = sb.toString();
|
||||
if (acronym.length() > minLen) {
|
||||
acronyms.add(acronym);
|
||||
}
|
||||
}
|
||||
return acronyms;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,14 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.lang.annotation.ElementType;
|
||||
import java.lang.annotation.Retention;
|
||||
import java.lang.annotation.RetentionPolicy;
|
||||
import java.lang.annotation.Target;
|
||||
|
||||
@Retention(RetentionPolicy.RUNTIME)
|
||||
@Target(ElementType.TYPE)
|
||||
public @interface ClusteringClass {
|
||||
|
||||
public String value();
|
||||
}
|
|
@ -0,0 +1,16 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
public interface ClusteringFunction {
|
||||
|
||||
public Collection<String> apply(Config config, List<String> fields);
|
||||
|
||||
public Map<String, Object> getParams();
|
||||
|
||||
}
|
|
@ -0,0 +1,28 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("immutablefieldvalue")
|
||||
public class ImmutableFieldValue extends AbstractClusteringFunction {
|
||||
|
||||
public ImmutableFieldValue(final Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(final Config conf, final String s) {
|
||||
final List<String> res = Lists.newArrayList();
|
||||
|
||||
res.add(s);
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,69 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.jayway.jsonpath.Configuration;
|
||||
import com.jayway.jsonpath.DocumentContext;
|
||||
import com.jayway.jsonpath.JsonPath;
|
||||
import com.jayway.jsonpath.Option;
|
||||
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.util.MapDocumentUtil;
|
||||
|
||||
@ClusteringClass("jsonlistclustering")
|
||||
public class JSONListClustering extends AbstractPaceFunctions implements ClusteringFunction {
|
||||
|
||||
private Map<String, Object> params;
|
||||
|
||||
public JSONListClustering(Map<String, Object> params) {
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> getParams() {
|
||||
return params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(Config conf, List<String> fields) {
|
||||
return fields
|
||||
.stream()
|
||||
.filter(f -> !f.isEmpty())
|
||||
.map(s -> doApply(conf, s))
|
||||
.filter(StringUtils::isNotBlank)
|
||||
.collect(Collectors.toCollection(HashSet::new));
|
||||
}
|
||||
|
||||
private String doApply(Config conf, String json) {
|
||||
StringBuilder st = new StringBuilder(); // to build the string used for comparisons basing on the jpath into
|
||||
// parameters
|
||||
final DocumentContext documentContext = JsonPath
|
||||
.using(Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS))
|
||||
.parse(json);
|
||||
|
||||
// for each path in the param list
|
||||
for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
|
||||
String path = params.get(key).toString();
|
||||
String value = MapDocumentUtil.getJPathString(path, documentContext);
|
||||
if (value == null || value.isEmpty())
|
||||
value = "";
|
||||
st.append(value);
|
||||
st.append(" ");
|
||||
}
|
||||
|
||||
st.setLength(st.length() - 1);
|
||||
|
||||
if (StringUtils.isBlank(st)) {
|
||||
return "1";
|
||||
}
|
||||
return st.toString();
|
||||
}
|
||||
}
|
|
@ -0,0 +1,54 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.*;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("keywordsclustering")
|
||||
public class KeywordsClustering extends AbstractClusteringFunction {
|
||||
|
||||
public KeywordsClustering(Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(final Config conf, String s) {
|
||||
|
||||
// takes city codes and keywords codes without duplicates
|
||||
Set<String> keywords = getKeywords(s, conf.translationMap(), paramOrDefault("windowSize", 4));
|
||||
Set<String> cities = getCities(s, paramOrDefault("windowSize", 4));
|
||||
|
||||
// list of combination to return as result
|
||||
final Collection<String> combinations = new LinkedHashSet<String>();
|
||||
|
||||
for (String keyword : keywordsToCodes(keywords, conf.translationMap())) {
|
||||
for (String city : citiesToCodes(cities)) {
|
||||
combinations.add(keyword + "-" + city);
|
||||
if (combinations.size() >= paramOrDefault("max", 2)) {
|
||||
return combinations;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return combinations;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(final Config conf, List<String> fields) {
|
||||
return fields
|
||||
.stream()
|
||||
.filter(f -> !f.isEmpty())
|
||||
.map(KeywordsClustering::cleanup)
|
||||
.map(KeywordsClustering::normalize)
|
||||
.map(s -> filterAllStopWords(s))
|
||||
.map(s -> doApply(conf, s))
|
||||
.map(c -> filterBlacklisted(c, ngramBlacklist))
|
||||
.flatMap(c -> c.stream())
|
||||
.filter(StringUtils::isNotBlank)
|
||||
.collect(Collectors.toCollection(HashSet::new));
|
||||
}
|
||||
}
|
|
@ -0,0 +1,78 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.*;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.model.Person;
|
||||
|
||||
@ClusteringClass("lnfi")
|
||||
public class LastNameFirstInitial extends AbstractClusteringFunction {
|
||||
|
||||
private boolean DEFAULT_AGGRESSIVE = true;
|
||||
|
||||
public LastNameFirstInitial(final Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(Config conf, List<String> fields) {
|
||||
return fields
|
||||
.stream()
|
||||
.filter(f -> !f.isEmpty())
|
||||
.map(LastNameFirstInitial::normalize)
|
||||
.map(s -> doApply(conf, s))
|
||||
.map(c -> filterBlacklisted(c, ngramBlacklist))
|
||||
.flatMap(c -> c.stream())
|
||||
.filter(StringUtils::isNotBlank)
|
||||
.collect(Collectors.toCollection(HashSet::new));
|
||||
}
|
||||
|
||||
public static String normalize(final String s) {
|
||||
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
|
||||
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
|
||||
// strings
|
||||
.replaceAll("[^ \\w]+", "")
|
||||
.replaceAll("(\\p{InCombiningDiacriticalMarks})+", "")
|
||||
.replaceAll("(\\p{Punct})+", " ")
|
||||
.replaceAll("(\\d)+", " ")
|
||||
.replaceAll("(\\n)+", " ")
|
||||
.trim();
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(final Config conf, final String s) {
|
||||
|
||||
final List<String> res = Lists.newArrayList();
|
||||
|
||||
final boolean aggressive = (Boolean) (getParams().containsKey("aggressive") ? getParams().get("aggressive")
|
||||
: DEFAULT_AGGRESSIVE);
|
||||
|
||||
Person p = new Person(s, aggressive);
|
||||
|
||||
if (p.isAccurate()) {
|
||||
String lastName = p.getNormalisedSurname().toLowerCase();
|
||||
String firstInitial = p.getNormalisedFirstName().toLowerCase().substring(0, 1);
|
||||
|
||||
res.add(firstInitial.concat(lastName));
|
||||
} else { // is not accurate, meaning it has no defined name and surname
|
||||
List<String> fullname = Arrays.asList(p.getNormalisedFullname().split(" "));
|
||||
if (fullname.size() == 1) {
|
||||
res.add(p.getNormalisedFullname().toLowerCase());
|
||||
} else if (fullname.size() == 2) {
|
||||
res.add(fullname.get(0).substring(0, 1).concat(fullname.get(1)).toLowerCase());
|
||||
res.add(fullname.get(1).substring(0, 1).concat(fullname.get(0)).toLowerCase());
|
||||
} else {
|
||||
res.add(fullname.get(0).substring(0, 1).concat(fullname.get(fullname.size() - 1)).toLowerCase());
|
||||
res.add(fullname.get(fullname.size() - 1).substring(0, 1).concat(fullname.get(0)).toLowerCase());
|
||||
}
|
||||
}
|
||||
|
||||
return res;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,38 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("lowercase")
|
||||
public class LowercaseClustering extends AbstractClusteringFunction {
|
||||
|
||||
public LowercaseClustering(final Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(Config conf, List<String> fields) {
|
||||
Collection<String> c = Sets.newLinkedHashSet();
|
||||
for (String f : fields) {
|
||||
c.addAll(doApply(conf, f));
|
||||
}
|
||||
return c;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(final Config conf, final String s) {
|
||||
if (StringUtils.isBlank(s)) {
|
||||
return Lists.newArrayList();
|
||||
}
|
||||
return Lists.newArrayList(s.toLowerCase().trim());
|
||||
}
|
||||
}
|
|
@ -0,0 +1,24 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Set;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
|
||||
public class NGramUtils extends AbstractPaceFunctions {
|
||||
static private final NGramUtils NGRAMUTILS = new NGramUtils();
|
||||
|
||||
private static final int SIZE = 100;
|
||||
|
||||
private static final Set<String> stopwords = AbstractPaceFunctions
|
||||
.loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");
|
||||
|
||||
public static String cleanupForOrdering(String s) {
|
||||
return (NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords) + StringUtils.repeat(" ", SIZE))
|
||||
.substring(0, SIZE)
|
||||
.replaceAll(" ", "");
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,41 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("ngrampairs")
|
||||
public class NgramPairs extends Ngrams {
|
||||
|
||||
public NgramPairs(Map<String, Object> params) {
|
||||
super(params, false);
|
||||
}
|
||||
|
||||
public NgramPairs(Map<String, Object> params, boolean sorted) {
|
||||
super(params, sorted);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
return ngramPairs(Lists.newArrayList(getNgrams(s, param("ngramLen"), param("max") * 2, 1, 2)), param("max"));
|
||||
}
|
||||
|
||||
protected Collection<String> ngramPairs(final List<String> ngrams, int maxNgrams) {
|
||||
Collection<String> res = Lists.newArrayList();
|
||||
int j = 0;
|
||||
for (int i = 0; i < ngrams.size() && res.size() < maxNgrams; i++) {
|
||||
if (++j >= ngrams.size()) {
|
||||
break;
|
||||
}
|
||||
res.add(ngrams.get(i) + ngrams.get(j));
|
||||
// System.out.println("-- " + concatNgrams);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,52 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("ngrams")
|
||||
public class Ngrams extends AbstractClusteringFunction {
|
||||
|
||||
private final boolean sorted;
|
||||
|
||||
public Ngrams(Map<String, Object> params) {
|
||||
this(params, false);
|
||||
}
|
||||
|
||||
public Ngrams(Map<String, Object> params, boolean sorted) {
|
||||
super(params);
|
||||
this.sorted = sorted;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
return getNgrams(s, param("ngramLen"), param("max"), param("maxPerToken"), param("minNgramLen"));
|
||||
}
|
||||
|
||||
protected Collection<String> getNgrams(String s, int ngramLen, int max, int maxPerToken, int minNgramLen) {
|
||||
|
||||
final Collection<String> ngrams = sorted ? new TreeSet<>() : new LinkedHashSet<String>();
|
||||
final StringTokenizer st = new StringTokenizer(s);
|
||||
|
||||
while (st.hasMoreTokens()) {
|
||||
final String token = st.nextToken();
|
||||
if (!token.isEmpty()) {
|
||||
for (int i = 0; i < maxPerToken && ngramLen + i <= token.length(); i++) {
|
||||
String ngram = token.substring(i, Math.min(ngramLen + i, token.length())).trim();
|
||||
|
||||
if (ngram.length() >= minNgramLen) {
|
||||
ngrams.add(ngram);
|
||||
|
||||
if (ngrams.size() >= max) {
|
||||
return ngrams;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
// System.out.println(ngrams + " n: " + ngrams.size());
|
||||
return ngrams;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,113 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.*;
|
||||
import java.util.stream.Collectors;
|
||||
import java.util.stream.StreamSupport;
|
||||
|
||||
import com.google.common.base.Splitter;
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("numAuthorsTitleSuffixPrefixChain")
|
||||
public class NumAuthorsTitleSuffixPrefixChain extends AbstractClusteringFunction {
|
||||
|
||||
public NumAuthorsTitleSuffixPrefixChain(Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(Config conf, List<String> fields) {
|
||||
|
||||
try {
|
||||
int num_authors = Math.min(Integer.parseInt(fields.get(0)), 21); // SIZE threshold is 20, +1
|
||||
|
||||
if (num_authors > 0) {
|
||||
return super.apply(conf, fields.subList(1, fields.size()))
|
||||
.stream()
|
||||
.map(s -> num_authors + "-" + s)
|
||||
.collect(Collectors.toList());
|
||||
}
|
||||
} catch (NumberFormatException e) {
|
||||
// missing or null authors array
|
||||
}
|
||||
|
||||
return Collections.emptyList();
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
return suffixPrefixChain(cleanup(s), param("mod"));
|
||||
}
|
||||
|
||||
private Collection<String> suffixPrefixChain(String s, int mod) {
|
||||
// create the list of words from the string (remove short words)
|
||||
List<String> wordsList = Arrays
|
||||
.stream(s.split(" "))
|
||||
.filter(si -> si.length() > 3)
|
||||
.collect(Collectors.toList());
|
||||
|
||||
final int words = wordsList.size();
|
||||
final int letters = s.length();
|
||||
|
||||
// create the prefix: number of words + number of letters/mod
|
||||
String prefix = words / mod + "-";
|
||||
|
||||
return doSuffixPrefixChain(wordsList, prefix);
|
||||
|
||||
}
|
||||
|
||||
private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
|
||||
|
||||
Set<String> set = Sets.newLinkedHashSet();
|
||||
switch (wordsList.size()) {
|
||||
case 0:
|
||||
break;
|
||||
case 1:
|
||||
set.add(wordsList.get(0));
|
||||
break;
|
||||
case 2:
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
suffix(wordsList.get(0), 3) +
|
||||
prefix(wordsList.get(1), 3));
|
||||
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
prefix(wordsList.get(0), 3) +
|
||||
suffix(wordsList.get(1), 3));
|
||||
|
||||
break;
|
||||
default:
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
suffix(wordsList.get(0), 3) +
|
||||
prefix(wordsList.get(1), 3) +
|
||||
suffix(wordsList.get(2), 3));
|
||||
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
prefix(wordsList.get(0), 3) +
|
||||
suffix(wordsList.get(1), 3) +
|
||||
prefix(wordsList.get(2), 3));
|
||||
break;
|
||||
}
|
||||
|
||||
return set;
|
||||
|
||||
}
|
||||
|
||||
private String suffix(String s, int len) {
|
||||
return s.substring(s.length() - len);
|
||||
}
|
||||
|
||||
private String prefix(String s, int len) {
|
||||
return s.substring(0, len);
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,84 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.model.Person;
|
||||
|
||||
@ClusteringClass("personClustering")
|
||||
public class PersonClustering extends AbstractPaceFunctions implements ClusteringFunction {
|
||||
|
||||
private Map<String, Object> params;
|
||||
|
||||
private static final int MAX_TOKENS = 5;
|
||||
|
||||
public PersonClustering(final Map<String, Object> params) {
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(final Config conf, final List<String> fields) {
|
||||
final Set<String> hashes = Sets.newHashSet();
|
||||
|
||||
for (final String f : fields) {
|
||||
|
||||
final Person person = new Person(f, false);
|
||||
|
||||
if (StringUtils.isNotBlank(person.getNormalisedFirstName())
|
||||
&& StringUtils.isNotBlank(person.getNormalisedSurname())) {
|
||||
hashes.add(firstLC(person.getNormalisedFirstName()) + person.getNormalisedSurname().toLowerCase());
|
||||
} else {
|
||||
for (final String token1 : tokens(f, MAX_TOKENS)) {
|
||||
for (final String token2 : tokens(f, MAX_TOKENS)) {
|
||||
if (!token1.equals(token2)) {
|
||||
hashes.add(firstLC(token1) + token2);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return hashes;
|
||||
}
|
||||
|
||||
// @Override
|
||||
// public Collection<String> apply(final List<Field> fields) {
|
||||
// final Set<String> hashes = Sets.newHashSet();
|
||||
//
|
||||
// for (final Field f : fields) {
|
||||
//
|
||||
// final GTAuthor gta = GTAuthor.fromOafJson(f.stringValue());
|
||||
//
|
||||
// final Author a = gta.getAuthor();
|
||||
//
|
||||
// if (StringUtils.isNotBlank(a.getFirstname()) && StringUtils.isNotBlank(a.getSecondnames())) {
|
||||
// hashes.add(firstLC(a.getFirstname()) + a.getSecondnames().toLowerCase());
|
||||
// } else {
|
||||
// for (final String token1 : tokens(f.stringValue(), MAX_TOKENS)) {
|
||||
// for (final String token2 : tokens(f.stringValue(), MAX_TOKENS)) {
|
||||
// if (!token1.equals(token2)) {
|
||||
// hashes.add(firstLC(token1) + token2);
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
//
|
||||
// return hashes;
|
||||
// }
|
||||
|
||||
@Override
|
||||
public Map<String, Object> getParams() {
|
||||
return params;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,34 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.model.Person;
|
||||
|
||||
@ClusteringClass("personHash")
|
||||
public class PersonHash extends AbstractClusteringFunction {
|
||||
|
||||
private boolean DEFAULT_AGGRESSIVE = false;
|
||||
|
||||
public PersonHash(final Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(final Config conf, final String s) {
|
||||
final List<String> res = Lists.newArrayList();
|
||||
|
||||
final boolean aggressive = (Boolean) (getParams().containsKey("aggressive") ? getParams().get("aggressive")
|
||||
: DEFAULT_AGGRESSIVE);
|
||||
|
||||
res.add(new Person(s, aggressive).hash());
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,20 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
public class RandomClusteringFunction extends AbstractClusteringFunction {
|
||||
|
||||
public RandomClusteringFunction(Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(final Config conf, String s) {
|
||||
return null;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,34 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.google.common.base.Joiner;
|
||||
import com.google.common.base.Splitter;
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("sortedngrampairs")
|
||||
public class SortedNgramPairs extends NgramPairs {
|
||||
|
||||
public SortedNgramPairs(Map<String, Object> params) {
|
||||
super(params, false);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
|
||||
final List<String> tokens = Lists.newArrayList(Splitter.on(" ").omitEmptyStrings().trimResults().split(s));
|
||||
|
||||
Collections.sort(tokens);
|
||||
|
||||
return ngramPairs(
|
||||
Lists.newArrayList(getNgrams(Joiner.on(" ").join(tokens), param("ngramLen"), param("max") * 2, 1, 2)),
|
||||
param("max"));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,34 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import org.apache.commons.lang3.RandomStringUtils;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("spacetrimmingfieldvalue")
|
||||
public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
|
||||
|
||||
public SpaceTrimmingFieldValue(final Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(final Config conf, final String s) {
|
||||
final List<String> res = Lists.newArrayList();
|
||||
|
||||
res
|
||||
.add(
|
||||
StringUtils.isBlank(s) ? RandomStringUtils.random(param("randomLength"))
|
||||
: s.toLowerCase().replaceAll("\\s+", ""));
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,42 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("suffixprefix")
|
||||
public class SuffixPrefix extends AbstractClusteringFunction {
|
||||
|
||||
public SuffixPrefix(Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
return suffixPrefix(s, param("len"), param("max"));
|
||||
}
|
||||
|
||||
private Collection<String> suffixPrefix(String s, int len, int max) {
|
||||
final Set<String> bigrams = Sets.newLinkedHashSet();
|
||||
int i = 0;
|
||||
while (++i < s.length() && bigrams.size() < max) {
|
||||
int j = s.indexOf(" ", i);
|
||||
|
||||
int offset = j + len + 1 < s.length() ? j + len + 1 : s.length();
|
||||
|
||||
if (j - len > 0) {
|
||||
String bigram = s.substring(j - len, offset).replaceAll(" ", "").trim();
|
||||
if (bigram.length() >= 4) {
|
||||
bigrams.add(bigram);
|
||||
}
|
||||
}
|
||||
}
|
||||
return bigrams;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,52 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.net.MalformedURLException;
|
||||
import java.net.URL;
|
||||
import java.util.Collection;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("urlclustering")
|
||||
public class UrlClustering extends AbstractPaceFunctions implements ClusteringFunction {
|
||||
|
||||
protected Map<String, Object> params;
|
||||
|
||||
public UrlClustering(final Map<String, Object> params) {
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> getParams() {
|
||||
return params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Collection<String> apply(final Config conf, List<String> fields) {
|
||||
try {
|
||||
return fields
|
||||
.stream()
|
||||
.filter(f -> !f.isEmpty())
|
||||
.map(this::asUrl)
|
||||
.map(URL::getHost)
|
||||
.collect(Collectors.toCollection(HashSet::new));
|
||||
} catch (IllegalStateException e) {
|
||||
return new HashSet<>();
|
||||
}
|
||||
}
|
||||
|
||||
private URL asUrl(String value) {
|
||||
try {
|
||||
return new URL(value);
|
||||
} catch (MalformedURLException e) {
|
||||
// should not happen as checked by pace typing
|
||||
throw new IllegalStateException("invalid URL: " + value);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,91 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.*;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("wordsStatsSuffixPrefixChain")
|
||||
public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
|
||||
|
||||
public WordsStatsSuffixPrefixChain(Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
return suffixPrefixChain(s, param("mod"));
|
||||
}
|
||||
|
||||
private Collection<String> suffixPrefixChain(String s, int mod) {
|
||||
|
||||
// create the list of words from the string (remove short words)
|
||||
List<String> wordsList = Arrays
|
||||
.stream(s.split(" "))
|
||||
.filter(si -> si.length() > 3)
|
||||
.collect(Collectors.toList());
|
||||
|
||||
final int words = wordsList.size();
|
||||
final int letters = s.length();
|
||||
|
||||
// create the prefix: number of words + number of letters/mod
|
||||
String prefix = words + "-" + letters / mod + "-";
|
||||
|
||||
return doSuffixPrefixChain(wordsList, prefix);
|
||||
|
||||
}
|
||||
|
||||
private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
|
||||
|
||||
Set<String> set = Sets.newLinkedHashSet();
|
||||
switch (wordsList.size()) {
|
||||
case 0:
|
||||
case 1:
|
||||
break;
|
||||
case 2:
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
suffix(wordsList.get(0), 3) +
|
||||
prefix(wordsList.get(1), 3));
|
||||
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
prefix(wordsList.get(0), 3) +
|
||||
suffix(wordsList.get(1), 3));
|
||||
|
||||
break;
|
||||
default:
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
suffix(wordsList.get(0), 3) +
|
||||
prefix(wordsList.get(1), 3) +
|
||||
suffix(wordsList.get(2), 3));
|
||||
|
||||
set
|
||||
.add(
|
||||
prefix +
|
||||
prefix(wordsList.get(0), 3) +
|
||||
suffix(wordsList.get(1), 3) +
|
||||
prefix(wordsList.get(2), 3));
|
||||
break;
|
||||
}
|
||||
|
||||
return set;
|
||||
|
||||
}
|
||||
|
||||
private String suffix(String s, int len) {
|
||||
return s.substring(s.length() - len);
|
||||
}
|
||||
|
||||
private String prefix(String s, int len) {
|
||||
return s.substring(0, len);
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,59 @@
|
|||
|
||||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
|
||||
@ClusteringClass("wordssuffixprefix")
|
||||
public class WordsSuffixPrefix extends AbstractClusteringFunction {
|
||||
|
||||
public WordsSuffixPrefix(Map<String, Object> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
return suffixPrefix(s, param("len"), param("max"));
|
||||
}
|
||||
|
||||
private Collection<String> suffixPrefix(String s, int len, int max) {
|
||||
|
||||
final int words = s.split(" ").length;
|
||||
|
||||
// adjust the token length according to the number of words
|
||||
switch (words) {
|
||||
case 1:
|
||||
return Sets.newLinkedHashSet();
|
||||
case 2:
|
||||
return doSuffixPrefix(s, len + 2, max, words);
|
||||
case 3:
|
||||
return doSuffixPrefix(s, len + 1, max, words);
|
||||
default:
|
||||
return doSuffixPrefix(s, len, max, words);
|
||||
}
|
||||
}
|
||||
|
||||
private Collection<String> doSuffixPrefix(String s, int len, int max, int words) {
|
||||
final Set<String> bigrams = Sets.newLinkedHashSet();
|
||||
int i = 0;
|
||||
while (++i < s.length() && bigrams.size() < max) {
|
||||
int j = s.indexOf(" ", i);
|
||||
|
||||
int offset = j + len + 1 < s.length() ? j + len + 1 : s.length();
|
||||
|
||||
if (j - len > 0) {
|
||||
String bigram = s.substring(j - len, offset).replaceAll(" ", "").trim();
|
||||
if (bigram.length() >= 4) {
|
||||
bigrams.add(words + bigram);
|
||||
}
|
||||
}
|
||||
}
|
||||
return bigrams;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,359 @@
|
|||
|
||||
package eu.dnetlib.pace.common;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.io.StringWriter;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.text.Normalizer;
|
||||
import java.util.*;
|
||||
import java.util.regex.Matcher;
|
||||
import java.util.regex.Pattern;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.google.common.base.Joiner;
|
||||
import com.google.common.base.Splitter;
|
||||
import com.google.common.collect.Iterables;
|
||||
import com.google.common.collect.Sets;
|
||||
import com.ibm.icu.text.Transliterator;
|
||||
|
||||
import eu.dnetlib.pace.clustering.NGramUtils;
|
||||
|
||||
/**
|
||||
* Set of common functions for the framework
|
||||
*
|
||||
* @author claudio
|
||||
*/
|
||||
public class AbstractPaceFunctions {
|
||||
|
||||
// city map to be used when translating the city names into codes
|
||||
private static Map<String, String> cityMap = AbstractPaceFunctions
|
||||
.loadMapFromClasspath("/eu/dnetlib/pace/config/city_map.csv");
|
||||
|
||||
// list of stopwords in different languages
|
||||
protected static Set<String> stopwords_gr = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_gr.txt");
|
||||
protected static Set<String> stopwords_en = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");
|
||||
protected static Set<String> stopwords_de = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_de.txt");
|
||||
protected static Set<String> stopwords_es = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_es.txt");
|
||||
protected static Set<String> stopwords_fr = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_fr.txt");
|
||||
protected static Set<String> stopwords_it = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_it.txt");
|
||||
protected static Set<String> stopwords_pt = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_pt.txt");
|
||||
|
||||
// transliterator
|
||||
protected static Transliterator transliterator = Transliterator.getInstance("Any-Eng");
|
||||
|
||||
// blacklist of ngrams: to avoid generic keys
|
||||
protected static Set<String> ngramBlacklist = loadFromClasspath("/eu/dnetlib/pace/config/ngram_blacklist.txt");
|
||||
|
||||
// html regex for normalization
|
||||
public static final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
|
||||
|
||||
private static final String alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ";
|
||||
private static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
|
||||
private static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";
|
||||
|
||||
// doi prefix for normalization
|
||||
public static final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
|
||||
|
||||
private static Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
|
||||
|
||||
private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
|
||||
|
||||
private static Pattern romanNumberPattern = Pattern
|
||||
.compile("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$");
|
||||
|
||||
protected static String concat(final List<String> l) {
|
||||
return Joiner.on(" ").skipNulls().join(l);
|
||||
}
|
||||
|
||||
public static String cleanup(final String s) {
|
||||
final String s1 = HTML_REGEX.matcher(s).replaceAll("");
|
||||
final String s2 = unicodeNormalization(s1.toLowerCase());
|
||||
final String s3 = nfd(s2);
|
||||
final String s4 = fixXML(s3);
|
||||
final String s5 = s4.replaceAll("([0-9]+)", " $1 ");
|
||||
final String s6 = transliterate(s5);
|
||||
final String s7 = fixAliases(s6);
|
||||
final String s8 = s7.replaceAll("[^\\p{ASCII}]", "");
|
||||
final String s9 = s8.replaceAll("[\\p{Punct}]", " ");
|
||||
final String s10 = s9.replaceAll("\\n", " ");
|
||||
final String s11 = s10.replaceAll("(?m)\\s+", " ");
|
||||
final String s12 = s11.trim();
|
||||
return s12;
|
||||
}
|
||||
|
||||
protected static String fixXML(final String a) {
|
||||
|
||||
return a
|
||||
.replaceAll("–", " ")
|
||||
.replaceAll("&", " ")
|
||||
.replaceAll(""", " ")
|
||||
.replaceAll("−", " ");
|
||||
}
|
||||
|
||||
protected static boolean checkNumbers(final String a, final String b) {
|
||||
final String numbersA = getNumbers(a);
|
||||
final String numbersB = getNumbers(b);
|
||||
final String romansA = getRomans(a);
|
||||
final String romansB = getRomans(b);
|
||||
return !numbersA.equals(numbersB) || !romansA.equals(romansB);
|
||||
}
|
||||
|
||||
protected static String getRomans(final String s) {
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
for (final String t : s.split(" ")) {
|
||||
sb.append(isRoman(t) ? t : "");
|
||||
}
|
||||
return sb.toString();
|
||||
}
|
||||
|
||||
protected static boolean isRoman(final String s) {
|
||||
Matcher m = romanNumberPattern.matcher(s);
|
||||
return m.matches() && m.hitEnd();
|
||||
}
|
||||
|
||||
protected static String getNumbers(final String s) {
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
for (final String t : s.split(" ")) {
|
||||
sb.append(isNumber(t) ? t : "");
|
||||
}
|
||||
return sb.toString();
|
||||
}
|
||||
|
||||
public static boolean isNumber(String strNum) {
|
||||
if (strNum == null) {
|
||||
return false;
|
||||
}
|
||||
return numberPattern.matcher(strNum).matches();
|
||||
}
|
||||
|
||||
protected static String fixAliases(final String s) {
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
|
||||
s.chars().forEach(ch -> {
|
||||
final int i = StringUtils.indexOf(aliases_from, ch);
|
||||
sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
|
||||
});
|
||||
|
||||
return sb.toString();
|
||||
}
|
||||
|
||||
protected static String transliterate(final String s) {
|
||||
try {
|
||||
return transliterator.transliterate(s);
|
||||
} catch (Exception e) {
|
||||
return s;
|
||||
}
|
||||
}
|
||||
|
||||
protected static String removeSymbols(final String s) {
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
|
||||
s.chars().forEach(ch -> {
|
||||
sb.append(StringUtils.contains(alpha, ch) ? (char) ch : ' ');
|
||||
});
|
||||
|
||||
return sb.toString().replaceAll("\\s+", " ");
|
||||
}
|
||||
|
||||
protected static boolean notNull(final String s) {
|
||||
return s != null;
|
||||
}
|
||||
|
||||
public static String normalize(final String s) {
|
||||
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
|
||||
.toLowerCase()
|
||||
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
|
||||
// strings
|
||||
.replaceAll("[^ \\w]+", "")
|
||||
.replaceAll("(\\p{InCombiningDiacriticalMarks})+", "")
|
||||
.replaceAll("(\\p{Punct})+", " ")
|
||||
.replaceAll("(\\d)+", " ")
|
||||
.replaceAll("(\\n)+", " ")
|
||||
.trim();
|
||||
}
|
||||
|
||||
public static String nfd(final String s) {
|
||||
return Normalizer.normalize(s, Normalizer.Form.NFD);
|
||||
}
|
||||
|
||||
public static String utf8(final String s) {
|
||||
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
|
||||
return new String(bytes, StandardCharsets.UTF_8);
|
||||
}
|
||||
|
||||
public static String unicodeNormalization(final String s) {
|
||||
|
||||
Matcher m = hexUnicodePattern.matcher(s);
|
||||
StringBuffer buf = new StringBuffer(s.length());
|
||||
while (m.find()) {
|
||||
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
|
||||
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
|
||||
}
|
||||
m.appendTail(buf);
|
||||
return buf.toString();
|
||||
}
|
||||
|
||||
protected static String filterStopWords(final String s, final Set<String> stopwords) {
|
||||
final StringTokenizer st = new StringTokenizer(s);
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
while (st.hasMoreTokens()) {
|
||||
final String token = st.nextToken();
|
||||
if (!stopwords.contains(token)) {
|
||||
sb.append(token);
|
||||
sb.append(" ");
|
||||
}
|
||||
}
|
||||
return sb.toString().trim();
|
||||
}
|
||||
|
||||
public static String filterAllStopWords(String s) {
|
||||
|
||||
s = filterStopWords(s, stopwords_en);
|
||||
s = filterStopWords(s, stopwords_de);
|
||||
s = filterStopWords(s, stopwords_it);
|
||||
s = filterStopWords(s, stopwords_fr);
|
||||
s = filterStopWords(s, stopwords_pt);
|
||||
s = filterStopWords(s, stopwords_es);
|
||||
s = filterStopWords(s, stopwords_gr);
|
||||
|
||||
return s;
|
||||
}
|
||||
|
||||
protected static Collection<String> filterBlacklisted(final Collection<String> set,
|
||||
final Set<String> ngramBlacklist) {
|
||||
final Set<String> newset = Sets.newLinkedHashSet();
|
||||
for (final String s : set) {
|
||||
if (!ngramBlacklist.contains(s)) {
|
||||
newset.add(s);
|
||||
}
|
||||
}
|
||||
return newset;
|
||||
}
|
||||
|
||||
public static Set<String> loadFromClasspath(final String classpath) {
|
||||
|
||||
Transliterator transliterator = Transliterator.getInstance("Any-Eng");
|
||||
|
||||
final Set<String> h = Sets.newHashSet();
|
||||
try {
|
||||
for (final String s : IOUtils
|
||||
.readLines(NGramUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
|
||||
h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
|
||||
}
|
||||
} catch (final Throwable e) {
|
||||
return Sets.newHashSet();
|
||||
}
|
||||
return h;
|
||||
}
|
||||
|
||||
public static Map<String, String> loadMapFromClasspath(final String classpath) {
|
||||
|
||||
Transliterator transliterator = Transliterator.getInstance("Any-Eng");
|
||||
|
||||
final Map<String, String> m = new HashMap<>();
|
||||
try {
|
||||
for (final String s : IOUtils
|
||||
.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
|
||||
// string is like this: code;word1;word2;word3
|
||||
String[] line = s.split(";");
|
||||
String value = line[0];
|
||||
for (int i = 1; i < line.length; i++) {
|
||||
m.put(fixAliases(transliterator.transliterate(line[i].toLowerCase())), value);
|
||||
}
|
||||
}
|
||||
} catch (final Throwable e) {
|
||||
return new HashMap<>();
|
||||
}
|
||||
return m;
|
||||
}
|
||||
|
||||
public static String removeKeywords(String s, Set<String> keywords) {
|
||||
|
||||
s = " " + s + " ";
|
||||
for (String k : keywords) {
|
||||
s = s.replaceAll(k.toLowerCase(), "");
|
||||
}
|
||||
|
||||
return s.trim();
|
||||
}
|
||||
|
||||
public static double commonElementsPercentage(Set<String> s1, Set<String> s2) {
|
||||
|
||||
double longer = Math.max(s1.size(), s2.size());
|
||||
return (double) s1.stream().filter(s2::contains).count() / longer;
|
||||
}
|
||||
|
||||
// convert the set of keywords to codes
|
||||
public static Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
|
||||
return keywords.stream().map(s -> translationMap.get(s)).collect(Collectors.toSet());
|
||||
}
|
||||
|
||||
public static Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
|
||||
return toCodes(keywords, translationMap);
|
||||
}
|
||||
|
||||
public static Set<String> citiesToCodes(Set<String> keywords) {
|
||||
return toCodes(keywords, cityMap);
|
||||
}
|
||||
|
||||
protected static String firstLC(final String s) {
|
||||
return StringUtils.substring(s, 0, 1).toLowerCase();
|
||||
}
|
||||
|
||||
protected static Iterable<String> tokens(final String s, final int maxTokens) {
|
||||
return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens);
|
||||
}
|
||||
|
||||
public static String normalizePid(String pid) {
|
||||
return DOI_PREFIX.matcher(pid.toLowerCase()).replaceAll("");
|
||||
}
|
||||
|
||||
// get the list of keywords into the input string
|
||||
public static Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
|
||||
|
||||
String s = s1;
|
||||
|
||||
List<String> tokens = Arrays.asList(s.toLowerCase().split(" "));
|
||||
|
||||
Set<String> codes = new HashSet<>();
|
||||
|
||||
if (tokens.size() < windowSize)
|
||||
windowSize = tokens.size();
|
||||
|
||||
int length = windowSize;
|
||||
|
||||
while (length != 0) {
|
||||
|
||||
for (int i = 0; i <= tokens.size() - length; i++) {
|
||||
String candidate = concat(tokens.subList(i, i + length));
|
||||
if (translationMap.containsKey(candidate)) {
|
||||
codes.add(candidate);
|
||||
s = s.replace(candidate, "").trim();
|
||||
}
|
||||
}
|
||||
|
||||
tokens = Arrays.asList(s.split(" "));
|
||||
length -= 1;
|
||||
}
|
||||
|
||||
return codes;
|
||||
}
|
||||
|
||||
public static Set<String> getCities(String s1, int windowSize) {
|
||||
return getKeywords(s1, cityMap, windowSize);
|
||||
}
|
||||
|
||||
public static <T> String readFromClasspath(final String filename, final Class<T> clazz) {
|
||||
final StringWriter sw = new StringWriter();
|
||||
try {
|
||||
IOUtils.copy(clazz.getResourceAsStream(filename), sw, StandardCharsets.UTF_8);
|
||||
return sw.toString();
|
||||
} catch (final IOException e) {
|
||||
throw new RuntimeException("cannot load resource from classpath: " + filename);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,53 @@
|
|||
|
||||
package eu.dnetlib.pace.config;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.function.Predicate;
|
||||
|
||||
import eu.dnetlib.pace.model.ClusteringDef;
|
||||
import eu.dnetlib.pace.model.FieldDef;
|
||||
import eu.dnetlib.pace.tree.support.TreeNodeDef;
|
||||
|
||||
/**
|
||||
* Interface for PACE configuration bean.
|
||||
*
|
||||
* @author claudio
|
||||
*/
|
||||
public interface Config {
|
||||
|
||||
/**
|
||||
* Field configuration definitions.
|
||||
*
|
||||
* @return the list of definitions
|
||||
*/
|
||||
public List<FieldDef> model();
|
||||
|
||||
/**
|
||||
* Decision Tree definition
|
||||
*
|
||||
* @return the map representing the decision tree
|
||||
*/
|
||||
public Map<String, TreeNodeDef> decisionTree();
|
||||
|
||||
/**
|
||||
* Clusterings.
|
||||
*
|
||||
* @return the list
|
||||
*/
|
||||
public List<ClusteringDef> clusterings();
|
||||
|
||||
/**
|
||||
* Blacklists.
|
||||
*
|
||||
* @return the map
|
||||
*/
|
||||
public Map<String, Predicate<String>> blacklists();
|
||||
|
||||
/**
|
||||
* Translation map.
|
||||
*
|
||||
* @return the map
|
||||
* */
|
||||
public Map<String, String> translationMap();
|
||||
}
|
|
@ -0,0 +1,178 @@
|
|||
|
||||
package eu.dnetlib.pace.config;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.io.Serializable;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.AbstractMap;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Map.Entry;
|
||||
import java.util.function.Predicate;
|
||||
import java.util.regex.Pattern;
|
||||
import java.util.regex.PatternSyntaxException;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.antlr.stringtemplate.StringTemplate;
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.fasterxml.jackson.annotation.JsonIgnore;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.google.common.collect.Maps;
|
||||
|
||||
import eu.dnetlib.pace.model.ClusteringDef;
|
||||
import eu.dnetlib.pace.model.FieldDef;
|
||||
import eu.dnetlib.pace.tree.support.TreeNodeDef;
|
||||
import eu.dnetlib.pace.util.PaceException;
|
||||
|
||||
public class DedupConfig implements Config, Serializable {
|
||||
private static String CONFIG_TEMPLATE = "dedupConfig.st";
|
||||
|
||||
private PaceConfig pace;
|
||||
|
||||
private WfConfig wf;
|
||||
|
||||
@JsonIgnore
|
||||
private Map<String, Predicate<String>> blacklists;
|
||||
|
||||
private static Map<String, String> defaults = Maps.newHashMap();
|
||||
|
||||
static {
|
||||
defaults.put("dedupRun", "001");
|
||||
defaults.put("entityType", "result");
|
||||
defaults.put("subEntityType", "resulttype");
|
||||
defaults.put("subEntityValue", "publication");
|
||||
defaults.put("orderField", "title");
|
||||
defaults.put("queueMaxSize", "2000");
|
||||
defaults.put("groupMaxSize", "10");
|
||||
defaults.put("slidingWindowSize", "200");
|
||||
defaults.put("rootBuilder", "result");
|
||||
defaults.put("includeChildren", "true");
|
||||
defaults.put("maxIterations", "20");
|
||||
defaults.put("idPath", "$.id");
|
||||
}
|
||||
|
||||
public DedupConfig() {
|
||||
}
|
||||
|
||||
public static DedupConfig load(final String json) {
|
||||
|
||||
final DedupConfig config;
|
||||
try {
|
||||
config = new ObjectMapper().readValue(json, DedupConfig.class);
|
||||
config.getPace().initModel();
|
||||
config.getPace().initTranslationMap();
|
||||
|
||||
config.blacklists = config
|
||||
.getPace()
|
||||
.getBlacklists()
|
||||
.entrySet()
|
||||
.stream()
|
||||
.map(
|
||||
e -> new AbstractMap.SimpleEntry<String, List<Pattern>>(e.getKey(),
|
||||
e
|
||||
.getValue()
|
||||
.stream()
|
||||
.filter(s -> !StringUtils.isBlank(s))
|
||||
.map(Pattern::compile)
|
||||
.collect(Collectors.toList())))
|
||||
.collect(
|
||||
Collectors
|
||||
.toMap(
|
||||
e -> e.getKey(),
|
||||
e -> (Predicate<String> & Serializable) s -> e
|
||||
.getValue()
|
||||
.stream()
|
||||
.filter(p -> p.matcher(s).matches())
|
||||
.findFirst()
|
||||
.isPresent()))
|
||||
|
||||
;
|
||||
|
||||
return config;
|
||||
} catch (IOException | PatternSyntaxException e) {
|
||||
throw new PaceException("Error in parsing configuration json", e);
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public static DedupConfig loadDefault() throws IOException {
|
||||
return loadDefault(new HashMap<String, String>());
|
||||
}
|
||||
|
||||
public static DedupConfig loadDefault(final Map<String, String> params) throws IOException {
|
||||
|
||||
final StringTemplate template = new StringTemplate(new DedupConfig().readFromClasspath(CONFIG_TEMPLATE));
|
||||
|
||||
for (final Entry<String, String> e : defaults.entrySet()) {
|
||||
template.setAttribute(e.getKey(), e.getValue());
|
||||
}
|
||||
for (final Entry<String, String> e : params.entrySet()) {
|
||||
if (template.getAttribute(e.getKey()) != null) {
|
||||
template.getAttributes().computeIfPresent(e.getKey(), (o, o2) -> e.getValue());
|
||||
} else {
|
||||
template.setAttribute(e.getKey(), e.getValue());
|
||||
}
|
||||
}
|
||||
|
||||
final String json = template.toString();
|
||||
return load(json);
|
||||
}
|
||||
|
||||
private String readFromClasspath(final String resource) throws IOException {
|
||||
return IOUtils.toString(getClass().getResource(resource), StandardCharsets.UTF_8);
|
||||
}
|
||||
|
||||
public PaceConfig getPace() {
|
||||
return pace;
|
||||
}
|
||||
|
||||
public void setPace(final PaceConfig pace) {
|
||||
this.pace = pace;
|
||||
}
|
||||
|
||||
public WfConfig getWf() {
|
||||
return wf;
|
||||
}
|
||||
|
||||
public void setWf(final WfConfig wf) {
|
||||
this.wf = wf;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
try {
|
||||
return new ObjectMapper().writeValueAsString(this);
|
||||
} catch (IOException e) {
|
||||
throw new PaceException("unable to serialise configuration", e);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, TreeNodeDef> decisionTree() {
|
||||
return getPace().getDecisionTree();
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<FieldDef> model() {
|
||||
return getPace().getModel();
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<ClusteringDef> clusterings() {
|
||||
return getPace().getClustering();
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Predicate<String>> blacklists() {
|
||||
return blacklists;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, String> translationMap() {
|
||||
return getPace().translationMap();
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,108 @@
|
|||
|
||||
package eu.dnetlib.pace.config;
|
||||
|
||||
import java.io.Serializable;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.fasterxml.jackson.annotation.JsonIgnore;
|
||||
import com.google.common.collect.Maps;
|
||||
import com.ibm.icu.text.Transliterator;
|
||||
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
import eu.dnetlib.pace.model.ClusteringDef;
|
||||
import eu.dnetlib.pace.model.FieldDef;
|
||||
import eu.dnetlib.pace.tree.support.TreeNodeDef;
|
||||
import eu.dnetlib.pace.util.PaceResolver;
|
||||
|
||||
public class PaceConfig extends AbstractPaceFunctions implements Serializable {
|
||||
|
||||
private List<FieldDef> model;
|
||||
|
||||
private List<ClusteringDef> clustering;
|
||||
private Map<String, TreeNodeDef> decisionTree;
|
||||
|
||||
private Map<String, List<String>> blacklists;
|
||||
private Map<String, List<String>> synonyms;
|
||||
|
||||
@JsonIgnore
|
||||
private Map<String, String> translationMap;
|
||||
|
||||
public Map<String, FieldDef> getModelMap() {
|
||||
return modelMap;
|
||||
}
|
||||
|
||||
@JsonIgnore
|
||||
private Map<String, FieldDef> modelMap;
|
||||
|
||||
@JsonIgnore
|
||||
public static PaceResolver resolver = new PaceResolver();
|
||||
|
||||
public PaceConfig() {
|
||||
}
|
||||
|
||||
public void initModel() {
|
||||
modelMap = Maps.newHashMap();
|
||||
for (FieldDef fd : getModel()) {
|
||||
modelMap.put(fd.getName(), fd);
|
||||
}
|
||||
}
|
||||
|
||||
public void initTranslationMap() {
|
||||
translationMap = Maps.newHashMap();
|
||||
|
||||
Transliterator transliterator = Transliterator.getInstance("Any-Eng");
|
||||
for (String key : synonyms.keySet()) {
|
||||
for (String term : synonyms.get(key)) {
|
||||
translationMap
|
||||
.put(
|
||||
fixAliases(transliterator.transliterate(term.toLowerCase())),
|
||||
key);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public Map<String, String> translationMap() {
|
||||
return translationMap;
|
||||
}
|
||||
|
||||
public List<FieldDef> getModel() {
|
||||
return model;
|
||||
}
|
||||
|
||||
public void setModel(final List<FieldDef> model) {
|
||||
this.model = model;
|
||||
}
|
||||
|
||||
public List<ClusteringDef> getClustering() {
|
||||
return clustering;
|
||||
}
|
||||
|
||||
public void setClustering(final List<ClusteringDef> clustering) {
|
||||
this.clustering = clustering;
|
||||
}
|
||||
|
||||
public Map<String, TreeNodeDef> getDecisionTree() {
|
||||
return decisionTree;
|
||||
}
|
||||
|
||||
public void setDecisionTree(Map<String, TreeNodeDef> decisionTree) {
|
||||
this.decisionTree = decisionTree;
|
||||
}
|
||||
|
||||
public Map<String, List<String>> getBlacklists() {
|
||||
return blacklists;
|
||||
}
|
||||
|
||||
public void setBlacklists(final Map<String, List<String>> blacklists) {
|
||||
this.blacklists = blacklists;
|
||||
}
|
||||
|
||||
public Map<String, List<String>> getSynonyms() {
|
||||
return synonyms;
|
||||
}
|
||||
|
||||
public void setSynonyms(Map<String, List<String>> synonyms) {
|
||||
this.synonyms = synonyms;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
package eu.dnetlib.pace.config;
|
||||
|
||||
public enum Type {
|
||||
String, Int, List, JSON, URL, StringConcat, DoubleArray
|
||||
}
|
|
@ -0,0 +1,294 @@
|
|||
|
||||
package eu.dnetlib.pace.config;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.io.Serializable;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Set;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.google.common.collect.Lists;
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.util.PaceException;
|
||||
|
||||
public class WfConfig implements Serializable {
|
||||
|
||||
/**
|
||||
* Entity type.
|
||||
*/
|
||||
private String entityType = "";
|
||||
|
||||
/**
|
||||
* Sub-Entity type refers to one of fields declared in the model. See eu.dnetlib.pace.config.PaceConfig.modelMap
|
||||
*/
|
||||
private String subEntityType = "";
|
||||
|
||||
/**
|
||||
* Sub-Entity value declares a value for subTypes to be considered.
|
||||
*/
|
||||
private String subEntityValue = "";
|
||||
|
||||
/**
|
||||
* Field name used to sort the values in the reducer phase.
|
||||
*/
|
||||
private String orderField = "";
|
||||
|
||||
/**
|
||||
* Column Families involved in the relations redirection.
|
||||
*/
|
||||
private List<String> rootBuilder = Lists.newArrayList();
|
||||
|
||||
/**
|
||||
* Set of datasource namespace prefixes that won't be deduplicated.
|
||||
*/
|
||||
private Set<String> skipList = Sets.newHashSet();
|
||||
|
||||
/**
|
||||
* Subprefix used to build the root id, allows multiple dedup runs.
|
||||
*/
|
||||
private String dedupRun = "";
|
||||
|
||||
/**
|
||||
* Similarity threshold.
|
||||
*/
|
||||
private double threshold = 0;
|
||||
|
||||
/** The queue max size. */
|
||||
private int queueMaxSize = 2000;
|
||||
|
||||
/** The group max size. */
|
||||
private int groupMaxSize;
|
||||
|
||||
/** The sliding window size. */
|
||||
private int slidingWindowSize;
|
||||
|
||||
/** The configuration id. */
|
||||
private String configurationId;
|
||||
|
||||
/** The include children. */
|
||||
private boolean includeChildren;
|
||||
|
||||
/** Default maximum number of allowed children. */
|
||||
private final static int MAX_CHILDREN = 10;
|
||||
|
||||
/** Maximum number of allowed children. */
|
||||
private int maxChildren = MAX_CHILDREN;
|
||||
|
||||
/** Default maximum number of iterations. */
|
||||
private final static int MAX_ITERATIONS = 20;
|
||||
|
||||
/** Maximum number of iterations */
|
||||
private int maxIterations = MAX_ITERATIONS;
|
||||
|
||||
/** The Jquery path to retrieve the identifier */
|
||||
private String idPath = "$.id";
|
||||
|
||||
public WfConfig() {
|
||||
}
|
||||
|
||||
/**
|
||||
* Instantiates a new dedup config.
|
||||
*
|
||||
* @param entityType
|
||||
* the entity type
|
||||
* @param orderField
|
||||
* the order field
|
||||
* @param rootBuilder
|
||||
* the root builder families
|
||||
* @param dedupRun
|
||||
* the dedup run
|
||||
* @param skipList
|
||||
* the skip list
|
||||
* @param queueMaxSize
|
||||
* the queue max size
|
||||
* @param groupMaxSize
|
||||
* the group max size
|
||||
* @param slidingWindowSize
|
||||
* the sliding window size
|
||||
* @param includeChildren
|
||||
* allows the children to be included in the representative records or not.
|
||||
* @param maxIterations
|
||||
* the maximum number of iterations
|
||||
* @param idPath
|
||||
* the path for the id of the entity
|
||||
*/
|
||||
public WfConfig(final String entityType, final String orderField, final List<String> rootBuilder,
|
||||
final String dedupRun,
|
||||
final Set<String> skipList, final int queueMaxSize, final int groupMaxSize, final int slidingWindowSize,
|
||||
final boolean includeChildren, final int maxIterations, final String idPath) {
|
||||
super();
|
||||
this.entityType = entityType;
|
||||
this.orderField = orderField;
|
||||
this.rootBuilder = rootBuilder;
|
||||
this.dedupRun = cleanupStringNumber(dedupRun);
|
||||
this.skipList = skipList;
|
||||
this.queueMaxSize = queueMaxSize;
|
||||
this.groupMaxSize = groupMaxSize;
|
||||
this.slidingWindowSize = slidingWindowSize;
|
||||
this.includeChildren = includeChildren;
|
||||
this.maxIterations = maxIterations;
|
||||
this.idPath = idPath;
|
||||
}
|
||||
|
||||
/**
|
||||
* Cleanup string number.
|
||||
*
|
||||
* @param s
|
||||
* the s
|
||||
* @return the string
|
||||
*/
|
||||
private String cleanupStringNumber(final String s) {
|
||||
return s.contains("'") ? s.replaceAll("'", "") : s;
|
||||
}
|
||||
|
||||
public boolean hasSubType() {
|
||||
return StringUtils.isNotBlank(getSubEntityType()) && StringUtils.isNotBlank(getSubEntityValue());
|
||||
}
|
||||
|
||||
public String getEntityType() {
|
||||
return entityType;
|
||||
}
|
||||
|
||||
public void setEntityType(final String entityType) {
|
||||
this.entityType = entityType;
|
||||
}
|
||||
|
||||
public String getSubEntityType() {
|
||||
return subEntityType;
|
||||
}
|
||||
|
||||
public void setSubEntityType(final String subEntityType) {
|
||||
this.subEntityType = subEntityType;
|
||||
}
|
||||
|
||||
public String getSubEntityValue() {
|
||||
return subEntityValue;
|
||||
}
|
||||
|
||||
public void setSubEntityValue(final String subEntityValue) {
|
||||
this.subEntityValue = subEntityValue;
|
||||
}
|
||||
|
||||
public String getOrderField() {
|
||||
return orderField;
|
||||
}
|
||||
|
||||
public void setOrderField(final String orderField) {
|
||||
this.orderField = orderField;
|
||||
}
|
||||
|
||||
public List<String> getRootBuilder() {
|
||||
return rootBuilder;
|
||||
}
|
||||
|
||||
public void setRootBuilder(final List<String> rootBuilder) {
|
||||
this.rootBuilder = rootBuilder;
|
||||
}
|
||||
|
||||
public Set<String> getSkipList() {
|
||||
return skipList != null ? skipList : new HashSet<String>();
|
||||
}
|
||||
|
||||
public void setSkipList(final Set<String> skipList) {
|
||||
this.skipList = skipList;
|
||||
}
|
||||
|
||||
public String getDedupRun() {
|
||||
return dedupRun;
|
||||
}
|
||||
|
||||
public void setDedupRun(final String dedupRun) {
|
||||
this.dedupRun = dedupRun;
|
||||
}
|
||||
|
||||
public double getThreshold() {
|
||||
return threshold;
|
||||
}
|
||||
|
||||
public void setThreshold(final double threshold) {
|
||||
this.threshold = threshold;
|
||||
}
|
||||
|
||||
public int getQueueMaxSize() {
|
||||
return queueMaxSize;
|
||||
}
|
||||
|
||||
public void setQueueMaxSize(final int queueMaxSize) {
|
||||
this.queueMaxSize = queueMaxSize;
|
||||
}
|
||||
|
||||
public int getGroupMaxSize() {
|
||||
return groupMaxSize;
|
||||
}
|
||||
|
||||
public void setGroupMaxSize(final int groupMaxSize) {
|
||||
this.groupMaxSize = groupMaxSize;
|
||||
}
|
||||
|
||||
public int getSlidingWindowSize() {
|
||||
return slidingWindowSize;
|
||||
}
|
||||
|
||||
public void setSlidingWindowSize(final int slidingWindowSize) {
|
||||
this.slidingWindowSize = slidingWindowSize;
|
||||
}
|
||||
|
||||
public String getConfigurationId() {
|
||||
return configurationId;
|
||||
}
|
||||
|
||||
public void setConfigurationId(final String configurationId) {
|
||||
this.configurationId = configurationId;
|
||||
}
|
||||
|
||||
public boolean isIncludeChildren() {
|
||||
return includeChildren;
|
||||
}
|
||||
|
||||
public void setIncludeChildren(final boolean includeChildren) {
|
||||
this.includeChildren = includeChildren;
|
||||
}
|
||||
|
||||
public int getMaxChildren() {
|
||||
return maxChildren;
|
||||
}
|
||||
|
||||
public void setMaxChildren(final int maxChildren) {
|
||||
this.maxChildren = maxChildren;
|
||||
}
|
||||
|
||||
public int getMaxIterations() {
|
||||
return maxIterations;
|
||||
}
|
||||
|
||||
public void setMaxIterations(int maxIterations) {
|
||||
this.maxIterations = maxIterations;
|
||||
}
|
||||
|
||||
public String getIdPath() {
|
||||
return idPath;
|
||||
}
|
||||
|
||||
public void setIdPath(String idPath) {
|
||||
this.idPath = idPath;
|
||||
|
||||
}
|
||||
|
||||
/*
|
||||
* (non-Javadoc)
|
||||
* @see java.lang.Object#toString()
|
||||
*/
|
||||
@Override
|
||||
public String toString() {
|
||||
try {
|
||||
return new ObjectMapper().writeValueAsString(this);
|
||||
} catch (IOException e) {
|
||||
throw new PaceException("unable to serialise " + this.getClass().getName(), e);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,63 @@
|
|||
|
||||
package eu.dnetlib.pace.model;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.io.Serializable;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.pace.clustering.ClusteringFunction;
|
||||
import eu.dnetlib.pace.config.PaceConfig;
|
||||
import eu.dnetlib.pace.util.PaceException;
|
||||
|
||||
public class ClusteringDef implements Serializable {
|
||||
|
||||
private String name;
|
||||
|
||||
private List<String> fields;
|
||||
|
||||
private Map<String, Object> params;
|
||||
|
||||
public ClusteringDef() {
|
||||
}
|
||||
|
||||
public String getName() {
|
||||
return name;
|
||||
}
|
||||
|
||||
public void setName(final String name) {
|
||||
this.name = name;
|
||||
}
|
||||
|
||||
public ClusteringFunction clusteringFunction() {
|
||||
return PaceConfig.resolver.getClusteringFunction(getName(), params);
|
||||
}
|
||||
|
||||
public List<String> getFields() {
|
||||
return fields;
|
||||
}
|
||||
|
||||
public void setFields(final List<String> fields) {
|
||||
this.fields = fields;
|
||||
}
|
||||
|
||||
public Map<String, Object> getParams() {
|
||||
return params;
|
||||
}
|
||||
|
||||
public void setParams(final Map<String, Object> params) {
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
try {
|
||||
return new ObjectMapper().writeValueAsString(this);
|
||||
} catch (IOException e) {
|
||||
throw new PaceException("unable to serialise " + this.getClass().getName(), e);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,138 @@
|
|||
|
||||
package eu.dnetlib.pace.model;
|
||||
|
||||
import java.io.Serializable;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
|
||||
import com.fasterxml.jackson.core.JsonProcessingException;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.google.common.base.Splitter;
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Type;
|
||||
|
||||
/**
|
||||
* The schema is composed by field definitions (FieldDef). Each field has a type, a name, and an associated compare algorithm.
|
||||
*/
|
||||
public class FieldDef implements Serializable {
|
||||
|
||||
public final static String PATH_SEPARATOR = "/";
|
||||
|
||||
private String name;
|
||||
|
||||
private String path;
|
||||
|
||||
private Type type;
|
||||
|
||||
private boolean overrideMatch;
|
||||
|
||||
/**
|
||||
* Sets maximum size for the repeatable fields in the model. -1 for unbounded size.
|
||||
*/
|
||||
private int size = -1;
|
||||
|
||||
/**
|
||||
* Sets maximum length for field values in the model. -1 for unbounded length.
|
||||
*/
|
||||
private int length = -1;
|
||||
|
||||
private HashSet<String> filter;
|
||||
|
||||
private boolean sorted;
|
||||
|
||||
public boolean isSorted() {
|
||||
return sorted;
|
||||
}
|
||||
|
||||
private String clean;
|
||||
|
||||
public FieldDef() {
|
||||
}
|
||||
|
||||
public String getName() {
|
||||
return name;
|
||||
}
|
||||
|
||||
public String getPath() {
|
||||
return path;
|
||||
}
|
||||
|
||||
public List<String> getPathList() {
|
||||
return Lists.newArrayList(Splitter.on(PATH_SEPARATOR).split(getPath()));
|
||||
}
|
||||
|
||||
public Type getType() {
|
||||
return type;
|
||||
}
|
||||
|
||||
public void setType(final Type type) {
|
||||
this.type = type;
|
||||
}
|
||||
|
||||
public boolean isOverrideMatch() {
|
||||
return overrideMatch;
|
||||
}
|
||||
|
||||
public void setOverrideMatch(final boolean overrideMatch) {
|
||||
this.overrideMatch = overrideMatch;
|
||||
}
|
||||
|
||||
public int getSize() {
|
||||
return size;
|
||||
}
|
||||
|
||||
public void setSize(int size) {
|
||||
this.size = size;
|
||||
}
|
||||
|
||||
public int getLength() {
|
||||
return length;
|
||||
}
|
||||
|
||||
public void setLength(int length) {
|
||||
this.length = length;
|
||||
}
|
||||
|
||||
public void setName(String name) {
|
||||
this.name = name;
|
||||
}
|
||||
|
||||
public void setPath(String path) {
|
||||
this.path = path;
|
||||
}
|
||||
|
||||
public HashSet<String> getFilter() {
|
||||
return filter;
|
||||
}
|
||||
|
||||
public void setFilter(HashSet<String> filter) {
|
||||
this.filter = filter;
|
||||
}
|
||||
|
||||
public boolean getSorted() {
|
||||
return sorted;
|
||||
}
|
||||
|
||||
public void setSorted(boolean sorted) {
|
||||
this.sorted = sorted;
|
||||
}
|
||||
|
||||
public String getClean() {
|
||||
return clean;
|
||||
}
|
||||
|
||||
public void setClean(String clean) {
|
||||
this.clean = clean;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
try {
|
||||
return new ObjectMapper().writeValueAsString(this);
|
||||
} catch (JsonProcessingException e) {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,156 @@
|
|||
|
||||
package eu.dnetlib.pace.model;
|
||||
|
||||
import java.nio.charset.Charset;
|
||||
import java.text.Normalizer;
|
||||
import java.util.List;
|
||||
import java.util.Set;
|
||||
|
||||
import com.google.common.base.Joiner;
|
||||
import com.google.common.base.Splitter;
|
||||
import com.google.common.collect.Iterables;
|
||||
import com.google.common.collect.Lists;
|
||||
import com.google.common.hash.Hashing;
|
||||
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
import eu.dnetlib.pace.util.Capitalise;
|
||||
import eu.dnetlib.pace.util.DotAbbreviations;
|
||||
|
||||
public class Person {
|
||||
|
||||
private static final String UTF8 = "UTF-8";
|
||||
private List<String> name = Lists.newArrayList();
|
||||
private List<String> surname = Lists.newArrayList();
|
||||
private List<String> fullname = Lists.newArrayList();
|
||||
private final String original;
|
||||
|
||||
private static Set<String> particles = null;
|
||||
|
||||
public Person(String s, final boolean aggressive) {
|
||||
original = s;
|
||||
s = Normalizer.normalize(s, Normalizer.Form.NFD);
|
||||
s = s.replaceAll("\\(.+\\)", "");
|
||||
s = s.replaceAll("\\[.+\\]", "");
|
||||
s = s.replaceAll("\\{.+\\}", "");
|
||||
s = s.replaceAll("\\s+-\\s+", "-");
|
||||
s = s.replaceAll("[\\p{Punct}&&[^,-]]", " ");
|
||||
s = s.replaceAll("\\d", " ");
|
||||
s = s.replaceAll("\\n", " ");
|
||||
s = s.replaceAll("\\.", " ");
|
||||
s = s.replaceAll("\\s+", " ");
|
||||
|
||||
if (aggressive) {
|
||||
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}&&[^,-]]", "");
|
||||
// s = s.replaceAll("[\\W&&[^,-]]", "");
|
||||
}
|
||||
|
||||
if (s.contains(",")) { // if the name contains a comma it is easy derivable the name and the surname
|
||||
final String[] arr = s.split(",");
|
||||
if (arr.length == 1) {
|
||||
fullname = splitTerms(arr[0]);
|
||||
} else if (arr.length > 1) {
|
||||
surname = splitTerms(arr[0]);
|
||||
name = splitTerms(arr[1]);
|
||||
fullname.addAll(surname);
|
||||
fullname.addAll(name);
|
||||
}
|
||||
} else {
|
||||
fullname = splitTerms(s);
|
||||
|
||||
int lastInitialPosition = fullname.size();
|
||||
boolean hasSurnameInUpperCase = false;
|
||||
|
||||
for (int i = 0; i < fullname.size(); i++) {
|
||||
final String term = fullname.get(i);
|
||||
if (term.length() == 1) {
|
||||
lastInitialPosition = i;
|
||||
} else if (term.equals(term.toUpperCase())) {
|
||||
hasSurnameInUpperCase = true;
|
||||
}
|
||||
}
|
||||
|
||||
if (lastInitialPosition < (fullname.size() - 1)) { // Case: Michele G. Artini
|
||||
name = fullname.subList(0, lastInitialPosition + 1);
|
||||
surname = fullname.subList(lastInitialPosition + 1, fullname.size());
|
||||
} else if (hasSurnameInUpperCase) { // Case: Michele ARTINI
|
||||
for (final String term : fullname) {
|
||||
if ((term.length() > 1) && term.equals(term.toUpperCase())) {
|
||||
surname.add(term);
|
||||
} else {
|
||||
name.add(term);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private List<String> splitTerms(final String s) {
|
||||
if (particles == null) {
|
||||
particles = AbstractPaceFunctions.loadFromClasspath("/eu/dnetlib/pace/config/name_particles.txt");
|
||||
}
|
||||
|
||||
final List<String> list = Lists.newArrayList();
|
||||
for (final String part : Splitter.on(" ").omitEmptyStrings().split(s)) {
|
||||
if (!particles.contains(part.toLowerCase())) {
|
||||
list.add(part);
|
||||
}
|
||||
}
|
||||
return list;
|
||||
}
|
||||
|
||||
public List<String> getName() {
|
||||
return name;
|
||||
}
|
||||
|
||||
public String getNameString() {
|
||||
return Joiner.on(" ").join(getName());
|
||||
}
|
||||
|
||||
public List<String> getSurname() {
|
||||
return surname;
|
||||
}
|
||||
|
||||
public List<String> getFullname() {
|
||||
return fullname;
|
||||
}
|
||||
|
||||
public String getOriginal() {
|
||||
return original;
|
||||
}
|
||||
|
||||
public String hash() {
|
||||
return Hashing.murmur3_128().hashString(getNormalisedFullname(), Charset.forName(UTF8)).toString();
|
||||
}
|
||||
|
||||
public String getNormalisedFirstName() {
|
||||
return Joiner.on(" ").join(getCapitalFirstnames());
|
||||
}
|
||||
|
||||
public String getNormalisedSurname() {
|
||||
return Joiner.on(" ").join(getCapitalSurname());
|
||||
}
|
||||
|
||||
public String getSurnameString() {
|
||||
return Joiner.on(" ").join(getSurname());
|
||||
}
|
||||
|
||||
public String getNormalisedFullname() {
|
||||
return isAccurate() ? getNormalisedSurname() + ", " + getNormalisedFirstName() : Joiner.on(" ").join(fullname);
|
||||
}
|
||||
|
||||
public List<String> getCapitalFirstnames() {
|
||||
return Lists.newArrayList(Iterables.transform(getNameWithAbbreviations(), new Capitalise()));
|
||||
}
|
||||
|
||||
public List<String> getCapitalSurname() {
|
||||
return Lists.newArrayList(Iterables.transform(surname, new Capitalise()));
|
||||
}
|
||||
|
||||
public List<String> getNameWithAbbreviations() {
|
||||
return Lists.newArrayList(Iterables.transform(name, new DotAbbreviations()));
|
||||
}
|
||||
|
||||
public boolean isAccurate() {
|
||||
return ((name != null) && (surname != null) && !name.isEmpty() && !surname.isEmpty());
|
||||
}
|
||||
}
|
|
@ -0,0 +1,119 @@
|
|||
|
||||
package eu.dnetlib.pace.model;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Set;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
public class PersonComparatorUtils {
|
||||
|
||||
private static final int MAX_FULLNAME_LENGTH = 50;
|
||||
|
||||
public static Set<String> getNgramsForPerson(String fullname) {
|
||||
|
||||
Set<String> set = Sets.newHashSet();
|
||||
|
||||
if (fullname.length() > MAX_FULLNAME_LENGTH) {
|
||||
return set;
|
||||
}
|
||||
|
||||
Person p = new Person(fullname, true);
|
||||
|
||||
if (p.isAccurate()) {
|
||||
for (String name : p.getName()) {
|
||||
for (String surname : p.getSurname()) {
|
||||
set.add((name.charAt(0) + "_" + surname).toLowerCase());
|
||||
}
|
||||
}
|
||||
} else {
|
||||
List<String> list = p.getFullname();
|
||||
for (int i = 0; i < list.size(); i++) {
|
||||
if (list.get(i).length() > 1) {
|
||||
for (int j = 0; j < list.size(); j++) {
|
||||
if (i != j) {
|
||||
set.add((list.get(j).charAt(0) + "_" + list.get(i)).toLowerCase());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return set;
|
||||
}
|
||||
|
||||
public static boolean areSimilar(String s1, String s2) {
|
||||
Person p1 = new Person(s1, true);
|
||||
Person p2 = new Person(s2, true);
|
||||
|
||||
if (p1.isAccurate() && p2.isAccurate()) {
|
||||
return verifyNames(p1.getName(), p2.getName()) && verifySurnames(p1.getSurname(), p2.getSurname());
|
||||
} else {
|
||||
return verifyFullnames(p1.getFullname(), p2.getFullname());
|
||||
}
|
||||
}
|
||||
|
||||
private static boolean verifyNames(List<String> list1, List<String> list2) {
|
||||
return verifySimilarity(extractExtendedNames(list1), extractExtendedNames(list2))
|
||||
&& verifySimilarity(extractInitials(list1), extractInitials(list2));
|
||||
}
|
||||
|
||||
private static boolean verifySurnames(List<String> list1, List<String> list2) {
|
||||
if (list1.size() != list2.size()) {
|
||||
return false;
|
||||
}
|
||||
for (int i = 0; i < list1.size(); i++) {
|
||||
if (!list1.get(i).equalsIgnoreCase(list2.get(i))) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
private static boolean verifyFullnames(List<String> list1, List<String> list2) {
|
||||
Collections.sort(list1);
|
||||
Collections.sort(list2);
|
||||
return verifySimilarity(extractExtendedNames(list1), extractExtendedNames(list2))
|
||||
&& verifySimilarity(extractInitials(list1), extractInitials(list2));
|
||||
}
|
||||
|
||||
private static List<String> extractExtendedNames(List<String> list) {
|
||||
ArrayList<String> res = Lists.newArrayList();
|
||||
for (String s : list) {
|
||||
if (s.length() > 1) {
|
||||
res.add(s.toLowerCase());
|
||||
}
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
||||
private static List<String> extractInitials(List<String> list) {
|
||||
ArrayList<String> res = Lists.newArrayList();
|
||||
for (String s : list) {
|
||||
res.add(s.substring(0, 1).toLowerCase());
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
||||
private static boolean verifySimilarity(List<String> list1, List<String> list2) {
|
||||
if (list1.size() > list2.size()) {
|
||||
return verifySimilarity(list2, list1);
|
||||
}
|
||||
|
||||
// NB: List2 is greater than list1 (or equal)
|
||||
int pos = -1;
|
||||
for (String s : list1) {
|
||||
int curr = list2.indexOf(s);
|
||||
if (curr > pos) {
|
||||
list2.set(curr, "*"); // I invalidate the found element, example: "amm - amm"
|
||||
pos = curr;
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,65 @@
|
|||
|
||||
package eu.dnetlib.pace.model;
|
||||
|
||||
import java.util.Comparator;
|
||||
|
||||
import org.apache.spark.sql.Row;
|
||||
|
||||
import eu.dnetlib.pace.clustering.NGramUtils;
|
||||
|
||||
/**
|
||||
* The Class MapDocumentComparator.
|
||||
*/
|
||||
public class RowDataOrderingComparator implements Comparator<Row> {
|
||||
|
||||
/** The comparator field. */
|
||||
private final int comparatorField;
|
||||
private final int identityFieldPosition;
|
||||
|
||||
/**
|
||||
* Instantiates a new map document comparator.
|
||||
*
|
||||
* @param comparatorField
|
||||
* the comparator field
|
||||
*/
|
||||
public RowDataOrderingComparator(final int comparatorField, int identityFieldPosition) {
|
||||
this.comparatorField = comparatorField;
|
||||
this.identityFieldPosition = identityFieldPosition;
|
||||
}
|
||||
|
||||
/*
|
||||
* (non-Javadoc)
|
||||
* @see java.util.Comparator#compare(java.lang.Object, java.lang.Object)
|
||||
*/
|
||||
@Override
|
||||
public int compare(final Row d1, final Row d2) {
|
||||
if (d1 == null)
|
||||
return d2 == null ? 0 : -1;
|
||||
else if (d2 == null) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
final String o1 = d1.getString(comparatorField);
|
||||
final String o2 = d2.getString(comparatorField);
|
||||
|
||||
if (o1 == null)
|
||||
return o2 == null ? 0 : -1;
|
||||
else if (o2 == null) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
final String to1 = NGramUtils.cleanupForOrdering(o1);
|
||||
final String to2 = NGramUtils.cleanupForOrdering(o2);
|
||||
|
||||
int res = to1.compareTo(to2);
|
||||
if (res == 0) {
|
||||
res = o1.compareTo(o2);
|
||||
if (res == 0) {
|
||||
return d1.getString(identityFieldPosition).compareTo(d2.getString(identityFieldPosition));
|
||||
}
|
||||
}
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,157 @@
|
|||
package eu.dnetlib.pace.model
|
||||
|
||||
import eu.dnetlib.pace.config.{DedupConfig, Type}
|
||||
import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
|
||||
import org.apache.spark.SparkContext
|
||||
import org.apache.spark.sql.catalyst.expressions.Literal
|
||||
import org.apache.spark.sql.expressions._
|
||||
import org.apache.spark.sql.functions.{col, desc, expr, lit, udf}
|
||||
import org.apache.spark.sql.types._
|
||||
import org.apache.spark.sql.{Column, Dataset, Row, SaveMode, functions}
|
||||
|
||||
import java.util.function.Predicate
|
||||
import java.util.stream.Collectors
|
||||
import scala.collection.JavaConversions._
|
||||
import scala.collection.JavaConverters._
|
||||
import scala.collection.mutable
|
||||
case class SparkDeduper(conf: DedupConfig) extends Serializable {
|
||||
|
||||
val model: SparkModel = SparkModel(conf)
|
||||
|
||||
val dedup: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
df.transform(filterAndCleanup)
|
||||
.transform(generateClustersWithCollect)
|
||||
.transform(processBlocks)
|
||||
}
|
||||
|
||||
|
||||
val filterAndCleanup: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
df_with_filters
|
||||
}
|
||||
|
||||
def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
|
||||
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
|
||||
|
||||
if (blacklist == null) {
|
||||
throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
|
||||
} else {
|
||||
fdef.getType match {
|
||||
case Type.List | Type.JSON =>
|
||||
udf[Array[String], Array[String]](values => {
|
||||
values.filter((v: String) => !blacklist.test(v))
|
||||
})
|
||||
|
||||
case _ =>
|
||||
udf[String, String](v => {
|
||||
if (blacklist.test(v)) ""
|
||||
else v
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => {
|
||||
var df_with_clustering_keys: Dataset[Row] = null
|
||||
|
||||
for ((cd, idx) <- conf.clusterings().zipWithIndex) {
|
||||
val inputColumns = cd.getFields().foldLeft(Seq[Column]())((acc, fName) => {
|
||||
val column = if (conf.blacklists.containsKey(fName))
|
||||
Seq(col(fName + "_filtered"))
|
||||
else
|
||||
Seq(col(fName))
|
||||
|
||||
acc ++ column
|
||||
})
|
||||
|
||||
// Add 'key' column with the value generated by the given clustering definition
|
||||
val ds: Dataset[Row] = df_with_filters
|
||||
.withColumn("clustering", lit(cd.getName + "::" + idx))
|
||||
.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
|
||||
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
|
||||
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
|
||||
// .withColumn("count", functions.max("position").over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName)).rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) ))
|
||||
// .filter("count > 1")
|
||||
|
||||
if (df_with_clustering_keys == null)
|
||||
df_with_clustering_keys = ds
|
||||
else
|
||||
df_with_clustering_keys = df_with_clustering_keys.union(ds)
|
||||
}
|
||||
|
||||
//TODO: analytics
|
||||
/*df_with_clustering_keys.groupBy(col("clustering"), col("key"))
|
||||
.agg(expr("max(count) AS size"))
|
||||
.orderBy(desc("size"))
|
||||
.show*/
|
||||
|
||||
val df_with_blocks = df_with_clustering_keys
|
||||
// split the clustering block into smaller blocks of queuemaxsize
|
||||
.groupBy(col("clustering"), col("key"), functions.floor(col("position").divide(lit(conf.getWf.getQueueMaxSize))))
|
||||
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
|
||||
.filter(functions.size(new Column("block")).gt(1))
|
||||
.union(
|
||||
//adjacency blocks
|
||||
df_with_clustering_keys
|
||||
// filter out leading and trailing elements
|
||||
.filter(col("position").gt(conf.getWf.getSlidingWindowSize/2))
|
||||
//.filter(col("position").lt(col("count").minus(conf.getWf.getSlidingWindowSize/2)))
|
||||
// create small blocks of records on "the border" of maxqueuesize: getSlidingWindowSize/2 elements before and after
|
||||
.filter(
|
||||
col("position").mod(conf.getWf.getQueueMaxSize).lt(conf.getWf.getSlidingWindowSize/2) // slice of the start of block
|
||||
|| col("position").mod(conf.getWf.getQueueMaxSize).gt(conf.getWf.getQueueMaxSize - (conf.getWf.getSlidingWindowSize/2)) //slice of the end of the block
|
||||
)
|
||||
.groupBy(col("clustering"), col("key"), functions.floor((col("position") + lit(conf.getWf.getSlidingWindowSize/2)).divide(lit(conf.getWf.getQueueMaxSize))))
|
||||
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
|
||||
.filter(functions.size(new Column("block")).gt(1))
|
||||
)
|
||||
|
||||
df_with_blocks
|
||||
}
|
||||
|
||||
def clusterValuesUDF(cd: ClusteringDef) = {
|
||||
udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
|
||||
val valueList = values.flatMap {
|
||||
case a: mutable.WrappedArray[Any] => a.map(_.toString)
|
||||
case s: Any => Seq(s.toString)
|
||||
}.asJava;
|
||||
|
||||
mutable.WrappedArray.make(cd.clusteringFunction().apply(conf, valueList).toArray())
|
||||
|
||||
})
|
||||
}
|
||||
|
||||
val processBlocks: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
|
||||
.withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
|
||||
.select(functions.explode(new Column("relations")).as("relation"))
|
||||
}
|
||||
|
||||
def processBlock(implicit sc: SparkContext) = {
|
||||
val accumulators = SparkReporter.constructAccumulator(conf, sc)
|
||||
|
||||
udf[Array[(String, String)], mutable.WrappedArray[Row]](block => {
|
||||
val reporter = new SparkReporter(accumulators)
|
||||
|
||||
val mapDocuments = block.asJava.stream()
|
||||
.sorted(new RowDataOrderingComparator(model.orderingFieldPosition, model.identityFieldPosition))
|
||||
.limit(conf.getWf.getQueueMaxSize)
|
||||
.collect(Collectors.toList[Row]())
|
||||
|
||||
new BlockProcessor(conf, model.identityFieldPosition, model.orderingFieldPosition).processSortedRows(mapDocuments, reporter)
|
||||
|
||||
reporter.getRelations.asScala.toArray
|
||||
}).asNondeterministic()
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,150 @@
|
|||
package eu.dnetlib.pace.model
|
||||
|
||||
import com.jayway.jsonpath.{Configuration, JsonPath}
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions
|
||||
import eu.dnetlib.pace.config.{DedupConfig, Type}
|
||||
import eu.dnetlib.pace.util.MapDocumentUtil
|
||||
import org.apache.commons.lang3.StringUtils
|
||||
import org.apache.spark.sql.catalyst.encoders.RowEncoder
|
||||
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
|
||||
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
|
||||
import org.apache.spark.sql.{Dataset, Row}
|
||||
|
||||
import java.util.Locale
|
||||
import java.util.regex.Pattern
|
||||
import scala.collection.JavaConverters._
|
||||
|
||||
case class SparkModel(conf: DedupConfig) {
|
||||
private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
|
||||
|
||||
private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
|
||||
|
||||
val identifierFieldName = "identifier"
|
||||
|
||||
val orderingFieldName = if (!conf.getWf.getOrderField.isEmpty) conf.getWf.getOrderField else identifierFieldName
|
||||
|
||||
val schema: StructType = {
|
||||
// create an implicit identifier field
|
||||
val identifier = new FieldDef()
|
||||
identifier.setName(identifierFieldName)
|
||||
identifier.setType(Type.String)
|
||||
|
||||
// Construct a Spark StructType representing the schema of the model
|
||||
(Seq(identifier) ++ conf.getPace.getModel.asScala)
|
||||
.foldLeft(
|
||||
new StructType()
|
||||
)((resType, fieldDef) => {
|
||||
resType.add(fieldDef.getType match {
|
||||
case Type.List | Type.JSON =>
|
||||
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
|
||||
case Type.DoubleArray =>
|
||||
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
|
||||
case _ =>
|
||||
StructField(fieldDef.getName, DataTypes.StringType, true, Metadata.empty)
|
||||
})
|
||||
})
|
||||
|
||||
|
||||
}
|
||||
|
||||
val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName)
|
||||
|
||||
val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
|
||||
|
||||
val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
|
||||
df.map(r => rowFromJson(r))(RowEncoder(schema))
|
||||
}
|
||||
|
||||
def rowFromJson(json: String): Row = {
|
||||
val documentContext =
|
||||
JsonPath.using(Configuration.defaultConfiguration.addOptions(com.jayway.jsonpath.Option.SUPPRESS_EXCEPTIONS)).parse(json)
|
||||
val values = new Array[Any](schema.size)
|
||||
|
||||
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
|
||||
|
||||
schema.fieldNames.zipWithIndex.foldLeft(values) {
|
||||
case ((res, (fname, index))) =>
|
||||
val fdef = conf.getPace.getModelMap.get(fname)
|
||||
|
||||
if (fdef != null) {
|
||||
res(index) = fdef.getType match {
|
||||
case Type.String | Type.Int =>
|
||||
MapDocumentUtil.truncateValue(
|
||||
MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
|
||||
fdef.getLength
|
||||
)
|
||||
|
||||
case Type.URL =>
|
||||
var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
|
||||
if (!URL_REGEX.matcher(uv).matches)
|
||||
uv = ""
|
||||
uv
|
||||
|
||||
case Type.List | Type.JSON =>
|
||||
MapDocumentUtil.truncateList(
|
||||
MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
|
||||
fdef.getSize
|
||||
).asScala
|
||||
|
||||
case Type.StringConcat =>
|
||||
val jpaths = CONCAT_REGEX.split(fdef.getPath)
|
||||
|
||||
MapDocumentUtil.truncateValue(
|
||||
jpaths
|
||||
.map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
|
||||
.mkString(" "),
|
||||
fdef.getLength
|
||||
)
|
||||
|
||||
case Type.DoubleArray =>
|
||||
MapDocumentUtil.getJPathArray(fdef.getPath, json)
|
||||
}
|
||||
|
||||
val filter = fdef.getFilter
|
||||
|
||||
if (StringUtils.isNotBlank(fdef.getClean)) {
|
||||
res(index) = res(index) match {
|
||||
case x: Seq[String] => x.map(clean(_, fdef.getClean)).toSeq
|
||||
case _ => clean(res(index).toString, fdef.getClean)
|
||||
}
|
||||
}
|
||||
|
||||
if (filter != null && !filter.isEmpty) {
|
||||
res(index) = res(index) match {
|
||||
case x: String if filter.contains(x.toLowerCase(Locale.ROOT)) => null
|
||||
case x: Seq[String] => x.filter(s => !filter.contains(s.toLowerCase(Locale.ROOT))).toSeq
|
||||
case _ => res(index)
|
||||
}
|
||||
}
|
||||
|
||||
if (fdef.getSorted) {
|
||||
res(index) = res(index) match {
|
||||
case x: Seq[String] => x.sorted.toSeq
|
||||
case _ => res(index)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
res
|
||||
}
|
||||
|
||||
new GenericRowWithSchema(values, schema)
|
||||
}
|
||||
|
||||
def clean(value: String, cleantype: String) : String = {
|
||||
val res = cleantype match {
|
||||
case "title" => AbstractPaceFunctions.cleanup(value)
|
||||
case _ => value
|
||||
}
|
||||
|
||||
// if (!res.equals(AbstractPaceFunctions.normalize(value))) {
|
||||
// println(res)
|
||||
// println(AbstractPaceFunctions.normalize(value))
|
||||
// println()
|
||||
// }
|
||||
|
||||
res
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("alwaysMatch")
|
||||
public class AlwaysMatch<T> extends AbstractComparator<T> {
|
||||
|
||||
public AlwaysMatch(final Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
public AlwaysMatch(final double weight) {
|
||||
super(weight, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
protected AlwaysMatch(final double weight, final AbstractStringDistance ssalgo) {
|
||||
super(weight, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(final Object a, final Object b, final Config conf) {
|
||||
return 1.0;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(final double d) {
|
||||
return d;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,164 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.model.Person;
|
||||
import eu.dnetlib.pace.tree.support.AbstractListComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("authorsMatch")
|
||||
public class AuthorsMatch extends AbstractListComparator {
|
||||
|
||||
Map<String, String> params;
|
||||
|
||||
private double SURNAME_THRESHOLD;
|
||||
private double NAME_THRESHOLD;
|
||||
private double FULLNAME_THRESHOLD;
|
||||
private String MODE; // full or surname
|
||||
private int SIZE_THRESHOLD;
|
||||
private String TYPE; // count or percentage
|
||||
|
||||
public AuthorsMatch(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.JaroWinkler());
|
||||
this.params = params;
|
||||
|
||||
MODE = params.getOrDefault("mode", "full");
|
||||
SURNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("surname_th", "0.95"));
|
||||
NAME_THRESHOLD = Double.parseDouble(params.getOrDefault("name_th", "0.95"));
|
||||
FULLNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("fullname_th", "0.9"));
|
||||
SIZE_THRESHOLD = Integer.parseInt(params.getOrDefault("size_th", "20"));
|
||||
TYPE = params.getOrDefault("type", "percentage");
|
||||
}
|
||||
|
||||
protected AuthorsMatch(double w, AbstractStringDistance ssalgo) {
|
||||
super(w, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(final List<String> a, final List<String> b, final Config conf) {
|
||||
if (a.isEmpty() || b.isEmpty())
|
||||
return -1;
|
||||
|
||||
if (a.size() > SIZE_THRESHOLD || b.size() > SIZE_THRESHOLD)
|
||||
return 1.0;
|
||||
|
||||
int maxMiss = Integer.MAX_VALUE;
|
||||
List<Person> bList = b.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
|
||||
|
||||
Double threshold = getDoubleParam("threshold");
|
||||
|
||||
if (threshold != null && threshold >= 0.0 && threshold <= 1.0 && a.size() == b.size()) {
|
||||
maxMiss = (int) Math.floor((1 - threshold) * Math.max(a.size(), b.size()));
|
||||
}
|
||||
|
||||
int common = 0;
|
||||
// compare each element of List1 with each element of List2
|
||||
for (int i = 0; i < a.size(); i++) {
|
||||
Person p1 = new Person(a.get(i), false);
|
||||
|
||||
for (Person p2 : bList) {
|
||||
// both persons are inaccurate
|
||||
if (!p1.isAccurate() && !p2.isAccurate()) {
|
||||
// compare just normalized fullnames
|
||||
String fullname1 = normalization(
|
||||
p1.getNormalisedFullname().isEmpty() ? p1.getOriginal() : p1.getNormalisedFullname());
|
||||
String fullname2 = normalization(
|
||||
p2.getNormalisedFullname().isEmpty() ? p2.getOriginal() : p2.getNormalisedFullname());
|
||||
|
||||
if (ssalgo.score(fullname1, fullname2) > FULLNAME_THRESHOLD) {
|
||||
common += 1;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// one person is inaccurate
|
||||
if (p1.isAccurate() ^ p2.isAccurate()) {
|
||||
// prepare data
|
||||
// data for the accurate person
|
||||
String name = normalization(
|
||||
p1.isAccurate() ? p1.getNormalisedFirstName() : p2.getNormalisedFirstName());
|
||||
String surname = normalization(
|
||||
p1.isAccurate() ? p1.getNormalisedSurname() : p2.getNormalisedSurname());
|
||||
|
||||
// data for the inaccurate person
|
||||
String fullname = normalization(
|
||||
p1.isAccurate()
|
||||
? ((p2.getNormalisedFullname().isEmpty()) ? p2.getOriginal() : p2.getNormalisedFullname())
|
||||
: (p1.getNormalisedFullname().isEmpty() ? p1.getOriginal() : p1.getNormalisedFullname()));
|
||||
|
||||
if (fullname.contains(surname)) {
|
||||
if (MODE.equals("full")) {
|
||||
if (fullname.contains(name)) {
|
||||
common += 1;
|
||||
break;
|
||||
}
|
||||
} else { // MODE equals "surname"
|
||||
common += 1;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// both persons are accurate
|
||||
if (p1.isAccurate() && p2.isAccurate()) {
|
||||
|
||||
if (compareSurname(p1, p2)) {
|
||||
if (MODE.equals("full")) {
|
||||
if (compareFirstname(p1, p2)) {
|
||||
common += 1;
|
||||
break;
|
||||
}
|
||||
} else { // MODE equals "surname"
|
||||
common += 1;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
if (i - common > maxMiss) {
|
||||
return 0.0;
|
||||
}
|
||||
}
|
||||
|
||||
// normalization factor to compute the score
|
||||
int normFactor = a.size() == b.size() ? a.size() : (a.size() + b.size() - common);
|
||||
|
||||
if (TYPE.equals("percentage")) {
|
||||
return (double) common / normFactor;
|
||||
} else {
|
||||
return (double) common;
|
||||
}
|
||||
}
|
||||
|
||||
public boolean compareSurname(Person p1, Person p2) {
|
||||
return ssalgo
|
||||
.score(
|
||||
normalization(p1.getNormalisedSurname()), normalization(p2.getNormalisedSurname())) > SURNAME_THRESHOLD;
|
||||
}
|
||||
|
||||
public boolean compareFirstname(Person p1, Person p2) {
|
||||
|
||||
if (p1.getNormalisedFirstName().length() <= 2 || p2.getNormalisedFirstName().length() <= 2) {
|
||||
if (firstLC(p1.getNormalisedFirstName()).equals(firstLC(p2.getNormalisedFirstName())))
|
||||
return true;
|
||||
}
|
||||
|
||||
return ssalgo
|
||||
.score(
|
||||
normalization(p1.getNormalisedFirstName()),
|
||||
normalization(p2.getNormalisedFirstName())) > NAME_THRESHOLD;
|
||||
}
|
||||
|
||||
public String normalization(String s) {
|
||||
return normalize(utf8(cleanup(s)));
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,48 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("cityMatch")
|
||||
public class CityMatch extends AbstractStringComparator {
|
||||
|
||||
private Map<String, String> params;
|
||||
|
||||
public CityMatch(Map<String, String> params) {
|
||||
super(params);
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(final String a, final String b, final Config conf) {
|
||||
|
||||
String ca = cleanup(a);
|
||||
String cb = cleanup(b);
|
||||
|
||||
ca = normalize(ca);
|
||||
cb = normalize(cb);
|
||||
|
||||
ca = filterAllStopWords(ca);
|
||||
cb = filterAllStopWords(cb);
|
||||
|
||||
Set<String> cities1 = getCities(ca, Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
Set<String> cities2 = getCities(cb, Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
|
||||
Set<String> codes1 = citiesToCodes(cities1);
|
||||
Set<String> codes2 = citiesToCodes(cities2);
|
||||
|
||||
// if no cities are detected, the comparator gives 1.0
|
||||
if (codes1.isEmpty() && codes2.isEmpty())
|
||||
return 1.0;
|
||||
else {
|
||||
if (codes1.isEmpty() ^ codes2.isEmpty())
|
||||
return -1; // undefined if one of the two has no cities
|
||||
return commonElementsPercentage(codes1, codes2);
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,47 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("cosineSimilarity")
|
||||
public class CosineSimilarity extends AbstractComparator<double[]> {
|
||||
|
||||
Map<String, String> params;
|
||||
|
||||
public CosineSimilarity(Map<String, String> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(Object a, Object b, Config config) {
|
||||
return compare((double[]) a, (double[]) b, config);
|
||||
}
|
||||
|
||||
public double compare(final double[] a, final double[] b, final Config conf) {
|
||||
|
||||
if (a.length == 0 || b.length == 0)
|
||||
return -1;
|
||||
|
||||
return cosineSimilarity(a, b);
|
||||
}
|
||||
|
||||
double cosineSimilarity(double[] a, double[] b) {
|
||||
double dotProduct = 0;
|
||||
double normASum = 0;
|
||||
double normBSum = 0;
|
||||
|
||||
for (int i = 0; i < a.length; i++) {
|
||||
dotProduct += a[i] * b[i];
|
||||
normASum += a[i] * a[i];
|
||||
normBSum += b[i] * b[i];
|
||||
}
|
||||
|
||||
double eucledianDist = Math.sqrt(normASum) * Math.sqrt(normBSum);
|
||||
return dotProduct / eucledianDist;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,27 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
/**
|
||||
* The Class ExactMatch.
|
||||
*
|
||||
* @author claudio
|
||||
*/
|
||||
@ComparatorClass("doiExactMatch")
|
||||
public class DoiExactMatch extends ExactMatchIgnoreCase {
|
||||
|
||||
public final String PREFIX = "(http:\\/\\/dx\\.doi\\.org\\/)|(doi:)";
|
||||
|
||||
public DoiExactMatch(final Map<String, String> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected String toString(final Object f) {
|
||||
return super.toString(f).replaceAll(PREFIX, "");
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,30 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.net.MalformedURLException;
|
||||
import java.net.URL;
|
||||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("domainExactMatch")
|
||||
public class DomainExactMatch extends ExactMatchIgnoreCase {
|
||||
|
||||
public DomainExactMatch(final Map<String, String> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected String toString(final Object f) {
|
||||
|
||||
try {
|
||||
return asUrl(super.toString(f)).getHost();
|
||||
} catch (MalformedURLException e) {
|
||||
return "";
|
||||
}
|
||||
}
|
||||
|
||||
private URL asUrl(final String value) throws MalformedURLException {
|
||||
return new URL(value);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,44 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("exactMatch")
|
||||
public class ExactMatch extends AbstractStringComparator {
|
||||
|
||||
public ExactMatch(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
public ExactMatch(final double weight) {
|
||||
super(weight, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
protected ExactMatch(final double weight, final AbstractStringDistance ssalgo) {
|
||||
super(weight, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(final String a, final String b, final Config conf) {
|
||||
if (a.isEmpty() || b.isEmpty()) {
|
||||
return -1.0; // return -1 if a field is missing
|
||||
}
|
||||
return a.equals(b) ? 1.0 : 0;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(final double d) {
|
||||
return d;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,29 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("exactMatchIgnoreCase")
|
||||
public class ExactMatchIgnoreCase extends AbstractStringComparator {
|
||||
|
||||
public ExactMatchIgnoreCase(Map<String, String> params) {
|
||||
super(params);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(String a, String b, final Config conf) {
|
||||
|
||||
if (a.isEmpty() || b.isEmpty())
|
||||
return -1;
|
||||
|
||||
return a.equalsIgnoreCase(b) ? 1 : 0;
|
||||
}
|
||||
|
||||
protected String toString(final Object object) {
|
||||
return toFirstString(object);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,80 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractListComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("instanceTypeMatch")
|
||||
public class InstanceTypeMatch extends AbstractListComparator {
|
||||
|
||||
final Map<String, String> translationMap = new HashMap<>();
|
||||
|
||||
public InstanceTypeMatch(Map<String, String> params) {
|
||||
super(params);
|
||||
|
||||
// jolly types
|
||||
translationMap.put("Conference object", "*");
|
||||
translationMap.put("Other literature type", "*");
|
||||
translationMap.put("Unknown", "*");
|
||||
translationMap.put("UNKNOWN", "*");
|
||||
|
||||
// article types
|
||||
translationMap.put("Article", "Article");
|
||||
translationMap.put("Data Paper", "Article");
|
||||
translationMap.put("Software Paper", "Article");
|
||||
translationMap.put("Preprint", "Article");
|
||||
|
||||
// thesis types
|
||||
translationMap.put("Thesis", "Thesis");
|
||||
translationMap.put("Master thesis", "Thesis");
|
||||
translationMap.put("Bachelor thesis", "Thesis");
|
||||
translationMap.put("Doctoral thesis", "Thesis");
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(final List<String> a, final List<String> b, final Config conf) {
|
||||
|
||||
if (a == null || b == null) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
if (a.isEmpty() || b.isEmpty()) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
final Set<String> ca = a.stream().map(this::translate).collect(Collectors.toSet());
|
||||
final Set<String> cb = b.stream().map(this::translate).collect(Collectors.toSet());
|
||||
|
||||
// if at least one is a jolly type, it must produce a match
|
||||
if (ca.contains("*") || cb.contains("*"))
|
||||
return 1.0;
|
||||
|
||||
int incommon = Sets.intersection(ca, cb).size();
|
||||
|
||||
// if at least one is in common, it must produce a match
|
||||
return incommon >= 1 ? 1 : 0;
|
||||
}
|
||||
|
||||
public String translate(String term) {
|
||||
return translationMap.getOrDefault(term, term);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(final double d) {
|
||||
return d;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,46 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
//case class JaroWinkler(w: Double) extends SecondStringDistanceAlgo(w, new com.wcohen.ss.JaroWinkler())
|
||||
@ComparatorClass("jaroWinkler")
|
||||
public class JaroWinkler extends AbstractStringComparator {
|
||||
|
||||
public JaroWinkler(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
public JaroWinkler(double weight) {
|
||||
super(weight, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
protected JaroWinkler(double weight, AbstractStringDistance ssalgo) {
|
||||
super(weight, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(String a, String b, final Config conf) {
|
||||
String ca = cleanup(a);
|
||||
String cb = cleanup(b);
|
||||
|
||||
return normalize(ssalgo.score(ca, cb));
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(double d) {
|
||||
return d;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,74 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("jaroWinklerNormalizedName")
|
||||
public class JaroWinklerNormalizedName extends AbstractStringComparator {
|
||||
|
||||
private Map<String, String> params;
|
||||
|
||||
public JaroWinklerNormalizedName(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.JaroWinkler());
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
public JaroWinklerNormalizedName(double weight) {
|
||||
super(weight, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
protected JaroWinklerNormalizedName(double weight, AbstractStringDistance ssalgo) {
|
||||
super(weight, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(String a, String b, final Config conf) {
|
||||
String ca = cleanup(a);
|
||||
String cb = cleanup(b);
|
||||
|
||||
ca = normalize(ca);
|
||||
cb = normalize(cb);
|
||||
|
||||
ca = filterAllStopWords(ca);
|
||||
cb = filterAllStopWords(cb);
|
||||
|
||||
Set<String> keywords1 = getKeywords(
|
||||
ca, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
Set<String> keywords2 = getKeywords(
|
||||
cb, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
|
||||
Set<String> cities1 = getCities(ca, Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
Set<String> cities2 = getCities(cb, Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
|
||||
ca = removeKeywords(ca, keywords1);
|
||||
ca = removeKeywords(ca, cities1);
|
||||
cb = removeKeywords(cb, keywords2);
|
||||
cb = removeKeywords(cb, cities2);
|
||||
|
||||
ca = ca.replaceAll("[ ]{2,}", " ");
|
||||
cb = cb.replaceAll("[ ]{2,}", " ");
|
||||
|
||||
if (ca.isEmpty() && cb.isEmpty())
|
||||
return 1.0;
|
||||
else
|
||||
return normalize(ssalgo.score(ca, cb));
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(double d) {
|
||||
return d;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,47 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
//case class JaroWinkler(w: Double) extends SecondStringDistanceAlgo(w, new com.wcohen.ss.JaroWinkler())
|
||||
@ComparatorClass("jaroWinklerTitle")
|
||||
public class JaroWinklerTitle extends AbstractStringComparator {
|
||||
|
||||
public JaroWinklerTitle(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
public JaroWinklerTitle(double weight) {
|
||||
super(weight, new com.wcohen.ss.JaroWinkler());
|
||||
}
|
||||
|
||||
protected JaroWinklerTitle(double weight, AbstractStringDistance ssalgo) {
|
||||
super(weight, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(String a, String b, final Config conf) {
|
||||
String ca = cleanup(a);
|
||||
String cb = cleanup(b);
|
||||
|
||||
boolean check = checkNumbers(ca, cb);
|
||||
return check ? 0.5 : normalize(ssalgo.score(ca, cb));
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(double d) {
|
||||
return d;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,82 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.logging.Log;
|
||||
import org.apache.commons.logging.LogFactory;
|
||||
|
||||
import com.google.common.collect.Sets;
|
||||
import com.jayway.jsonpath.Configuration;
|
||||
import com.jayway.jsonpath.DocumentContext;
|
||||
import com.jayway.jsonpath.JsonPath;
|
||||
import com.jayway.jsonpath.Option;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractListComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
import eu.dnetlib.pace.util.MapDocumentUtil;
|
||||
|
||||
@ComparatorClass("jsonListMatch")
|
||||
public class JsonListMatch extends AbstractListComparator {
|
||||
|
||||
private static final Log log = LogFactory.getLog(JsonListMatch.class);
|
||||
private Map<String, String> params;
|
||||
|
||||
private String MODE; // "percentage" or "count"
|
||||
|
||||
public JsonListMatch(final Map<String, String> params) {
|
||||
super(params);
|
||||
this.params = params;
|
||||
|
||||
MODE = params.getOrDefault("mode", "percentage");
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(final List<String> sa, final List<String> sb, final Config conf) {
|
||||
if (sa.isEmpty() || sb.isEmpty()) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
final Set<String> ca = sa.stream().map(this::toComparableString).collect(Collectors.toSet());
|
||||
final Set<String> cb = sb.stream().map(this::toComparableString).collect(Collectors.toSet());
|
||||
|
||||
int incommon = Sets.intersection(ca, cb).size();
|
||||
int simDiff = Sets.symmetricDifference(ca, cb).size();
|
||||
|
||||
if (incommon + simDiff == 0) {
|
||||
return 0.0;
|
||||
}
|
||||
|
||||
if (MODE.equals("percentage"))
|
||||
return (double) incommon / (incommon + simDiff);
|
||||
else
|
||||
return incommon;
|
||||
|
||||
}
|
||||
|
||||
// converts every json into a comparable string basing on parameters
|
||||
private String toComparableString(String json) {
|
||||
|
||||
StringBuilder st = new StringBuilder(); // to build the string used for comparisons basing on the jpath into
|
||||
// parameters
|
||||
final DocumentContext documentContext = JsonPath
|
||||
.using(Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS))
|
||||
.parse(json);
|
||||
// for each path in the param list
|
||||
for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
|
||||
String path = params.get(key);
|
||||
String value = MapDocumentUtil.getJPathString(path, documentContext);
|
||||
if (value == null || value.isEmpty())
|
||||
value = "";
|
||||
st.append(value);
|
||||
st.append("::");
|
||||
}
|
||||
|
||||
st.setLength(st.length() - 2);
|
||||
return st.toString();
|
||||
}
|
||||
}
|
|
@ -0,0 +1,50 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("keywordMatch")
|
||||
public class KeywordMatch extends AbstractStringComparator {
|
||||
|
||||
Map<String, String> params;
|
||||
|
||||
public KeywordMatch(Map<String, String> params) {
|
||||
super(params);
|
||||
this.params = params;
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(final String a, final String b, final Config conf) {
|
||||
|
||||
String ca = cleanup(a);
|
||||
String cb = cleanup(b);
|
||||
|
||||
ca = normalize(ca);
|
||||
cb = normalize(cb);
|
||||
|
||||
ca = filterAllStopWords(ca);
|
||||
cb = filterAllStopWords(cb);
|
||||
|
||||
Set<String> keywords1 = getKeywords(
|
||||
ca, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
Set<String> keywords2 = getKeywords(
|
||||
cb, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
|
||||
|
||||
Set<String> codes1 = toCodes(keywords1, conf.translationMap());
|
||||
Set<String> codes2 = toCodes(keywords2, conf.translationMap());
|
||||
|
||||
// if no cities are detected, the comparator gives 1.0
|
||||
if (codes1.isEmpty() && codes2.isEmpty())
|
||||
return 1.0;
|
||||
else {
|
||||
if (codes1.isEmpty() ^ codes2.isEmpty())
|
||||
return -1.0; // undefined if one of the two has no keywords
|
||||
return commonElementsPercentage(codes1, codes2);
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,36 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("level2JaroWinkler")
|
||||
public class Level2JaroWinkler extends AbstractStringComparator {
|
||||
|
||||
public Level2JaroWinkler(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.Level2JaroWinkler());
|
||||
}
|
||||
|
||||
public Level2JaroWinkler(double w) {
|
||||
super(w, new com.wcohen.ss.Level2JaroWinkler());
|
||||
}
|
||||
|
||||
protected Level2JaroWinkler(double w, AbstractStringDistance ssalgo) {
|
||||
super(w, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(double d) {
|
||||
return d;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,50 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("level2JaroWinklerTitle")
|
||||
public class Level2JaroWinklerTitle extends AbstractStringComparator {
|
||||
|
||||
public Level2JaroWinklerTitle(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.Level2JaroWinkler());
|
||||
}
|
||||
|
||||
public Level2JaroWinklerTitle(final double w) {
|
||||
super(w, new com.wcohen.ss.Level2JaroWinkler());
|
||||
}
|
||||
|
||||
protected Level2JaroWinklerTitle(final double w, final AbstractStringDistance ssalgo) {
|
||||
super(w, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(final String a, final String b, final Config conf) {
|
||||
final String ca = cleanup(a);
|
||||
final String cb = cleanup(b);
|
||||
|
||||
final boolean check = checkNumbers(ca, cb);
|
||||
|
||||
if (check)
|
||||
return 0.5;
|
||||
|
||||
return ssalgo.score(ca, cb);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(final double d) {
|
||||
return d;
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,36 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("level2Levenstein")
|
||||
public class Level2Levenstein extends AbstractStringComparator {
|
||||
|
||||
public Level2Levenstein(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.Level2Levenstein());
|
||||
}
|
||||
|
||||
public Level2Levenstein(double w) {
|
||||
super(w, new com.wcohen.ss.Level2Levenstein());
|
||||
}
|
||||
|
||||
protected Level2Levenstein(double w, AbstractStringDistance ssalgo) {
|
||||
super(w, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(double d) {
|
||||
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,36 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("levenstein")
|
||||
public class Levenstein extends AbstractStringComparator {
|
||||
|
||||
public Levenstein(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.Levenstein());
|
||||
}
|
||||
|
||||
public Levenstein(double w) {
|
||||
super(w, new com.wcohen.ss.Levenstein());
|
||||
}
|
||||
|
||||
protected Levenstein(double w, AbstractStringDistance ssalgo) {
|
||||
super(w, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(double d) {
|
||||
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,69 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.commons.logging.Log;
|
||||
import org.apache.commons.logging.LogFactory;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("levensteinTitle")
|
||||
public class LevensteinTitle extends AbstractStringComparator {
|
||||
|
||||
private static final Log log = LogFactory.getLog(LevensteinTitle.class);
|
||||
|
||||
public LevensteinTitle(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.Levenstein());
|
||||
}
|
||||
|
||||
public LevensteinTitle(final double w) {
|
||||
super(w, new com.wcohen.ss.Levenstein());
|
||||
}
|
||||
|
||||
protected LevensteinTitle(final double w, final AbstractStringDistance ssalgo) {
|
||||
super(w, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(final String ca, final String cb, final Config conf) {
|
||||
final boolean check = checkNumbers(ca, cb);
|
||||
|
||||
if (check)
|
||||
return 0.5;
|
||||
|
||||
Double threshold = getDoubleParam("threshold");
|
||||
|
||||
// reduce Levenshtein algo complexity when target threshold is known
|
||||
if (threshold != null && threshold >= 0.0 && threshold <= 1.0) {
|
||||
int maxdistance = (int) Math.floor((1 - threshold) * Math.max(ca.length(), cb.length()));
|
||||
int score = StringUtils.getLevenshteinDistance(ca, cb, maxdistance);
|
||||
if (score == -1) {
|
||||
return 0;
|
||||
}
|
||||
return normalize(score, ca.length(), cb.length());
|
||||
} else {
|
||||
return normalize(StringUtils.getLevenshteinDistance(ca, cb), ca.length(), cb.length());
|
||||
}
|
||||
}
|
||||
|
||||
private double normalize(final double score, final int la, final int lb) {
|
||||
return 1 - (Math.abs(score) / Math.max(la, lb));
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(final double d) {
|
||||
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,58 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
/**
|
||||
* Compared compare between two titles, ignoring version numbers. Suitable for Software entities.
|
||||
*/
|
||||
@ComparatorClass("levensteinTitleIgnoreVersion")
|
||||
public class LevensteinTitleIgnoreVersion extends AbstractStringComparator {
|
||||
|
||||
public LevensteinTitleIgnoreVersion(Map<String, String> params) {
|
||||
super(params, new com.wcohen.ss.Levenstein());
|
||||
}
|
||||
|
||||
public LevensteinTitleIgnoreVersion(final double w) {
|
||||
super(w, new com.wcohen.ss.Levenstein());
|
||||
}
|
||||
|
||||
protected LevensteinTitleIgnoreVersion(final double w, final AbstractStringDistance ssalgo) {
|
||||
super(w, ssalgo);
|
||||
}
|
||||
|
||||
@Override
|
||||
public double distance(final String a, final String b, final Config conf) {
|
||||
String ca = cleanup(a);
|
||||
String cb = cleanup(b);
|
||||
|
||||
ca = ca.replaceAll("\\d", "").replaceAll(getRomans(ca), "").trim();
|
||||
cb = cb.replaceAll("\\d", "").replaceAll(getRomans(cb), "").trim();
|
||||
|
||||
ca = filterAllStopWords(ca);
|
||||
cb = filterAllStopWords(cb);
|
||||
|
||||
return normalize(ssalgo.score(ca, cb), ca.length(), cb.length());
|
||||
}
|
||||
|
||||
private double normalize(final double score, final int la, final int lb) {
|
||||
return 1 - (Math.abs(score) / Math.max(la, lb));
|
||||
}
|
||||
|
||||
@Override
|
||||
public double getWeight() {
|
||||
return super.weight;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected double normalize(final double d) {
|
||||
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
|
||||
}
|
||||
|
||||
}
|
|
@ -0,0 +1,66 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractListComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
/**
|
||||
* The Class Contains match
|
||||
*
|
||||
* @author miconis
|
||||
* */
|
||||
@ComparatorClass("listContainsMatch")
|
||||
public class ListContainsMatch extends AbstractListComparator {
|
||||
|
||||
private Map<String, String> params;
|
||||
private boolean CASE_SENSITIVE;
|
||||
private String STRING;
|
||||
private String AGGREGATOR;
|
||||
|
||||
public ListContainsMatch(Map<String, String> params) {
|
||||
super(params);
|
||||
this.params = params;
|
||||
|
||||
// read parameters
|
||||
CASE_SENSITIVE = Boolean.parseBoolean(params.getOrDefault("caseSensitive", "false"));
|
||||
STRING = params.get("string");
|
||||
AGGREGATOR = params.get("bool");
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(List<String> sa, List<String> sb, Config conf) {
|
||||
if (sa.isEmpty() || sb.isEmpty()) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
if (!CASE_SENSITIVE) {
|
||||
sa = sa.stream().map(String::toLowerCase).collect(Collectors.toList());
|
||||
sb = sb.stream().map(String::toLowerCase).collect(Collectors.toList());
|
||||
STRING = STRING.toLowerCase();
|
||||
}
|
||||
|
||||
switch (AGGREGATOR) {
|
||||
case "AND":
|
||||
if (sa.contains(STRING) && sb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
case "OR":
|
||||
if (sa.contains(STRING) || sb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
case "XOR":
|
||||
if (sa.contains(STRING) ^ sb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
default:
|
||||
return 0.0;
|
||||
}
|
||||
return 0.0;
|
||||
|
||||
}
|
||||
}
|
|
@ -0,0 +1,29 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
@ComparatorClass("maxLengthMatch")
|
||||
public class MaxLengthMatch extends AbstractStringComparator {
|
||||
|
||||
private final int limit;
|
||||
|
||||
public MaxLengthMatch(Map<String, String> params) {
|
||||
super(params);
|
||||
|
||||
limit = Integer.parseInt(params.getOrDefault("limit", "200"));
|
||||
}
|
||||
|
||||
@Override
|
||||
public double compare(String a, String b, final Config conf) {
|
||||
return a.length() < limit && b.length() < limit ? 1.0 : -1.0;
|
||||
}
|
||||
|
||||
protected String toString(final Object object) {
|
||||
return toFirstString(object);
|
||||
}
|
||||
}
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue