forked from D-Net/dnet-hadoop
Compare commits
20 Commits
master
...
dedup-with
Author | SHA1 | Date |
---|---|---|
Sandro La Bruzzo | edf5a780b8 | |
Sandro La Bruzzo | 74fcea66e6 | |
Sandro La Bruzzo | e4feedd67e | |
Giambattista Bloisi | 617ef05e15 | |
Giambattista Bloisi | b6a8be813b | |
Sandro La Bruzzo | f1ae28fe42 | |
Sandro La Bruzzo | acf947442a | |
Giambattista Bloisi | d80f12da06 | |
Giambattista Bloisi | 861c368e65 | |
Giambattista Bloisi | 745e70e0d7 | |
Giambattista Bloisi | dcc08cc512 | |
Giambattista Bloisi | df19548c56 | |
Sandro La Bruzzo | 890b49fb5d | |
Giambattista Bloisi | 3129c1c48b | |
Giambattista Bloisi | cb7ad9889c | |
Claudio Atzori | 75ff902f9d | |
Claudio Atzori | 326367eccc | |
Claudio Atzori | 521dd7f167 | |
Claudio Atzori | 649679de8d | |
Sandro La Bruzzo | 4c2dfcbdf7 |
|
@ -26,4 +26,3 @@ spark-warehouse
|
|||
/**/*.log
|
||||
/**/.factorypath
|
||||
/**/.scalafmt.conf
|
||||
/.java-version
|
||||
|
|
128
README.md
128
README.md
|
@ -1,128 +1,2 @@
|
|||
# dnet-hadoop
|
||||
|
||||
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
|
||||
|
||||
How to build, package and run oozie workflows
|
||||
====================
|
||||
|
||||
Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz`
|
||||
package that contains resources that define a workflow and some helper scripts.
|
||||
|
||||
This module is automatically executed when running:
|
||||
|
||||
`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
|
||||
|
||||
on module having set:
|
||||
|
||||
```
|
||||
<parent>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-workflows</artifactId>
|
||||
</parent>
|
||||
```
|
||||
|
||||
in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to
|
||||
a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory).
|
||||
|
||||
The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
|
||||
|
||||
- jar packages
|
||||
- workflow definitions
|
||||
- job properties
|
||||
- maintenance scripts
|
||||
|
||||
Required properties
|
||||
====================
|
||||
|
||||
In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided
|
||||
by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
|
||||
|
||||
In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with
|
||||
the following properties:
|
||||
|
||||
- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine
|
||||
- `dhp.hadoop.frontend.host.name` - frontend host name
|
||||
- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files
|
||||
- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port
|
||||
- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job
|
||||
- `nameNode` - name node address
|
||||
- `jobTracker` - job tracker address
|
||||
- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output
|
||||
produced by `run_workflow.sh` script (needed to obtain oozie job id)
|
||||
- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster
|
||||
- `sparkDriverMemory` - amount of memory assigned to spark jobs driver
|
||||
- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors
|
||||
- `sparkExecutorCores` - number of cores assigned to spark jobs executors
|
||||
|
||||
All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's
|
||||
main folder.
|
||||
|
||||
When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory
|
||||
(the one containing `pom.xml` file) and define all new properties which will override existing properties.
|
||||
One can provide those properties one by one as command line `-D` arguments.
|
||||
|
||||
Properties overriding order is the following:
|
||||
|
||||
1. `pom.xml` defined properties (located in the project root dir)
|
||||
2. `~/.dhp/application.properties` defined properties
|
||||
3. `${workflow.source.dir}/job.properties`
|
||||
4. `job-override.properties` (located in the project root dir)
|
||||
5. `maven -Dparam=value`
|
||||
|
||||
where the maven `-Dparam` property is overriding all the other ones.
|
||||
|
||||
Workflow definition requirements
|
||||
====================
|
||||
|
||||
`workflow.source.dir` property should point to the following directory structure:
|
||||
|
||||
[${workflow.source.dir}]
|
||||
|
|
||||
|-job.properties (optional)
|
||||
|
|
||||
\-[oozie_app]
|
||||
|
|
||||
\-workflow.xml
|
||||
|
||||
This property can be set using maven `-D` switch.
|
||||
|
||||
`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is
|
||||
provided with directory name as value.
|
||||
|
||||
Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory.
|
||||
|
||||
Creating oozie installer step-by-step
|
||||
=====================================
|
||||
|
||||
Automated oozie-installer steps are the following:
|
||||
|
||||
1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies`
|
||||
2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties`
|
||||
3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
|
||||
4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
|
||||
5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}`
|
||||
6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven,
|
||||
`~/.dhp/application.properties`, `job.properties` and `job-override.properties`
|
||||
7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages
|
||||
created at step (1) to each one of them
|
||||
8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package
|
||||
|
||||
Uploading oozie package and running workflow on cluster
|
||||
=======================================================
|
||||
|
||||
In order to simplify deployment and execution process two dedicated profiles were introduced:
|
||||
|
||||
- `deploy`
|
||||
- `run`
|
||||
|
||||
to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters.
|
||||
|
||||
The `deploy` profile supplements packaging process with:
|
||||
1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine
|
||||
2) extracting uploaded package
|
||||
3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties)
|
||||
|
||||
The `run` profile introduces:
|
||||
1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file.
|
||||
|
||||
Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.
|
||||
Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
|
|
@ -62,11 +62,6 @@
|
|||
</build>
|
||||
|
||||
<dependencies>
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-pace-core</artifactId>
|
||||
<version>${project.version}</version>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>org.apache.hadoop</groupId>
|
||||
|
@ -129,6 +124,12 @@
|
|||
<dependency>
|
||||
<groupId>eu.dnetlib</groupId>
|
||||
<artifactId>cnr-rmi-api</artifactId>
|
||||
<exclusions>
|
||||
<exclusion>
|
||||
<groupId>log4j</groupId>
|
||||
<artifactId>log4j</artifactId>
|
||||
</exclusion>
|
||||
</exclusions>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
|
@ -149,6 +150,11 @@
|
|||
<artifactId>okhttp</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-pace-core</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>org.apache.httpcomponents</groupId>
|
||||
<artifactId>httpclient</artifactId>
|
||||
|
@ -161,7 +167,7 @@
|
|||
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>${dhp-schemas.artifact}</artifactId>
|
||||
<artifactId>dhp-schemas_${scala.binary.version}</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
|
|
|
@ -11,18 +11,25 @@ import org.apache.commons.lang3.StringUtils;
|
|||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.function.FilterFunction;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.sql.*;
|
||||
import org.apache.spark.sql.Encoders;
|
||||
import org.apache.spark.sql.SaveMode;
|
||||
import org.apache.spark.sql.SparkSession;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.common.HdfsSupport;
|
||||
import eu.dnetlib.dhp.schema.common.ModelSupport;
|
||||
import eu.dnetlib.dhp.schema.oaf.Oaf;
|
||||
import eu.dnetlib.dhp.schema.oaf.OafEntity;
|
||||
|
||||
public class DispatchEntitiesSparkJob {
|
||||
|
||||
private static final Logger log = LoggerFactory.getLogger(DispatchEntitiesSparkJob.class);
|
||||
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
|
||||
public static void main(String[] args) throws Exception {
|
||||
|
||||
String jsonConfiguration = IOUtils
|
||||
|
@ -47,51 +54,44 @@ public class DispatchEntitiesSparkJob {
|
|||
String outputPath = parser.get("outputPath");
|
||||
log.info("outputPath: {}", outputPath);
|
||||
|
||||
boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
|
||||
log.info("filterInvisible: {}", filterInvisible);
|
||||
String graphTableClassName = parser.get("graphTableClassName");
|
||||
log.info("graphTableClassName: {}", graphTableClassName);
|
||||
|
||||
@SuppressWarnings("unchecked")
|
||||
Class<? extends OafEntity> entityClazz = (Class<? extends OafEntity>) Class.forName(graphTableClassName);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
runWithSparkSession(
|
||||
conf,
|
||||
isSparkSessionManaged,
|
||||
spark -> dispatchEntities(spark, inputPath, outputPath, filterInvisible));
|
||||
spark -> {
|
||||
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
|
||||
dispatchEntities(spark, inputPath, entityClazz, outputPath);
|
||||
});
|
||||
}
|
||||
|
||||
private static void dispatchEntities(
|
||||
private static <T extends Oaf> void dispatchEntities(
|
||||
SparkSession spark,
|
||||
String inputPath,
|
||||
String outputPath,
|
||||
boolean filterInvisible) {
|
||||
Class<T> clazz,
|
||||
String outputPath) {
|
||||
|
||||
Dataset<String> df = spark.read().textFile(inputPath);
|
||||
|
||||
ModelSupport.oafTypes.entrySet().parallelStream().forEach(entry -> {
|
||||
String entityType = entry.getKey();
|
||||
Class<?> clazz = entry.getValue();
|
||||
|
||||
final String entityPath = outputPath + "/" + entityType;
|
||||
if (!entityType.equalsIgnoreCase("relation")) {
|
||||
HdfsSupport.remove(entityPath, spark.sparkContext().hadoopConfiguration());
|
||||
Dataset<Row> entityDF = spark
|
||||
.read()
|
||||
.schema(Encoders.bean(clazz).schema())
|
||||
.json(
|
||||
df
|
||||
.filter((FilterFunction<String>) s -> s.startsWith(clazz.getName()))
|
||||
.map(
|
||||
(MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"),
|
||||
Encoders.STRING()));
|
||||
|
||||
if (filterInvisible) {
|
||||
entityDF = entityDF.filter("dataInfo.invisible != true");
|
||||
}
|
||||
|
||||
entityDF
|
||||
.write()
|
||||
.mode(SaveMode.Overwrite)
|
||||
.option("compression", "gzip")
|
||||
.json(entityPath);
|
||||
}
|
||||
});
|
||||
spark
|
||||
.read()
|
||||
.textFile(inputPath)
|
||||
.filter((FilterFunction<String>) s -> isEntityType(s, clazz))
|
||||
.map((MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"), Encoders.STRING())
|
||||
.map(
|
||||
(MapFunction<String, T>) value -> OBJECT_MAPPER.readValue(value, clazz),
|
||||
Encoders.bean(clazz))
|
||||
.write()
|
||||
.mode(SaveMode.Overwrite)
|
||||
.option("compression", "gzip")
|
||||
.json(outputPath);
|
||||
}
|
||||
|
||||
private static <T extends Oaf> boolean isEntityType(final String s, final Class<T> clazz) {
|
||||
return StringUtils.substringBefore(s, "|").equals(clazz.getName());
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -6,16 +6,14 @@ import java.util.regex.Pattern;
|
|||
|
||||
public class FundRefCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("\\d+");
|
||||
public static String clean(final String fundrefId) {
|
||||
|
||||
public static String clean(final String fundRefId) {
|
||||
|
||||
String s = fundRefId
|
||||
String s = fundrefId
|
||||
.toLowerCase()
|
||||
.replaceAll("\\s", "");
|
||||
|
||||
Matcher m = PATTERN.matcher(s);
|
||||
if (m.find()) {
|
||||
Matcher m = Pattern.compile("\\d+").matcher(s);
|
||||
if (m.matches()) {
|
||||
return m.group();
|
||||
} else {
|
||||
return "";
|
||||
|
|
|
@ -13,7 +13,11 @@ import java.util.stream.Collectors;
|
|||
import java.util.stream.Stream;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.sql.Encoders;
|
||||
|
||||
import com.fasterxml.jackson.core.JsonProcessingException;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.github.sisyphsu.dateparser.DateParserUtils;
|
||||
import com.google.common.collect.Lists;
|
||||
import com.google.common.collect.Sets;
|
||||
|
@ -35,7 +39,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);
|
||||
|
||||
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
|
||||
private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";
|
||||
|
||||
public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
|
||||
if (ModelSupport.isSubClass(value, Result.class)) {
|
||||
|
@ -225,7 +228,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
}
|
||||
|
||||
public static <T extends Oaf> boolean filter(T value) {
|
||||
if (!(value instanceof Relation) && (Boolean.TRUE
|
||||
if (Boolean.TRUE
|
||||
.equals(
|
||||
Optional
|
||||
.ofNullable(value)
|
||||
|
@ -236,16 +239,15 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
d -> Optional
|
||||
.ofNullable(d.getInvisible())
|
||||
.orElse(true))
|
||||
.orElse(false))
|
||||
.orElse(true)))) {
|
||||
.orElse(true))
|
||||
.orElse(true))) {
|
||||
return true;
|
||||
}
|
||||
|
||||
if (value instanceof Datasource) {
|
||||
// nothing to evaluate here
|
||||
} else if (value instanceof Project) {
|
||||
final Project p = (Project) value;
|
||||
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
|
||||
// nothing to evaluate here
|
||||
} else if (value instanceof Organization) {
|
||||
// nothing to evaluate here
|
||||
} else if (value instanceof Relation) {
|
||||
|
@ -292,13 +294,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
} else if (value instanceof Result) {
|
||||
Result r = (Result) value;
|
||||
|
||||
if (Objects.nonNull(r.getFulltext())
|
||||
&& (ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
|
||||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
|
||||
r.setFulltext(null);
|
||||
|
||||
}
|
||||
|
||||
if (Objects.nonNull(r.getDateofacceptance())) {
|
||||
Optional<String> date = cleanDateField(r.getDateofacceptance());
|
||||
if (date.isPresent()) {
|
||||
|
@ -323,18 +318,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
|
||||
.collect(Collectors.toList()));
|
||||
}
|
||||
if (Objects.nonNull(r.getPublisher())) {
|
||||
if (StringUtils.isBlank(r.getPublisher().getValue())) {
|
||||
r.setPublisher(null);
|
||||
} else {
|
||||
r
|
||||
.getPublisher()
|
||||
.setValue(
|
||||
r
|
||||
.getPublisher()
|
||||
.getValue()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " "));
|
||||
}
|
||||
if (Objects.nonNull(r.getPublisher()) && StringUtils.isBlank(r.getPublisher().getValue())) {
|
||||
r.setPublisher(null);
|
||||
}
|
||||
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
|
||||
r
|
||||
|
@ -501,11 +486,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
i.setDateofacceptance(null);
|
||||
}
|
||||
}
|
||||
if (StringUtils.isNotBlank(i.getFulltext()) &&
|
||||
(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
|
||||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
|
||||
i.setFulltext(null);
|
||||
}
|
||||
}
|
||||
}
|
||||
if (Objects.isNull(r.getBestaccessright())
|
||||
|
@ -530,7 +510,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
.filter(Objects::nonNull)
|
||||
.filter(a -> StringUtils.isNotBlank(a.getFullname()))
|
||||
.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
|
||||
.map(GraphCleaningFunctions::cleanupAuthor)
|
||||
.collect(Collectors.toList()));
|
||||
|
||||
boolean nullRank = r
|
||||
|
@ -625,35 +604,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
|||
return value;
|
||||
}
|
||||
|
||||
private static Author cleanupAuthor(Author author) {
|
||||
if (StringUtils.isNotBlank(author.getFullname())) {
|
||||
author
|
||||
.setFullname(
|
||||
author
|
||||
.getFullname()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " ")
|
||||
.replace("\"", "\\\""));
|
||||
}
|
||||
if (StringUtils.isNotBlank(author.getName())) {
|
||||
author
|
||||
.setName(
|
||||
author
|
||||
.getName()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " ")
|
||||
.replace("\"", "\\\""));
|
||||
}
|
||||
if (StringUtils.isNotBlank(author.getSurname())) {
|
||||
author
|
||||
.setSurname(
|
||||
author
|
||||
.getSurname()
|
||||
.replaceAll(NAME_CLEANING_REGEX, " ")
|
||||
.replace("\"", "\\\""));
|
||||
}
|
||||
|
||||
return author;
|
||||
}
|
||||
|
||||
private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
|
||||
return Optional
|
||||
.ofNullable(dateofacceptance)
|
||||
|
|
|
@ -6,19 +6,13 @@ import java.util.regex.Pattern;
|
|||
|
||||
public class GridCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("(?<grid>\\d{4,6}\\.[0-9a-z]{1,2})");
|
||||
|
||||
public static String clean(String grid) {
|
||||
String s = grid
|
||||
.replaceAll("\\s", "")
|
||||
.toLowerCase();
|
||||
|
||||
Matcher m = PATTERN.matcher(s);
|
||||
if (m.find()) {
|
||||
return "grid." + m.group("grid");
|
||||
}
|
||||
|
||||
return "";
|
||||
Matcher m = Pattern.compile("\\d{4,6}\\.[0-9a-z]{1,2}").matcher(s);
|
||||
return m.matches() ? "grid." + m.group() : "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -7,12 +7,10 @@ import java.util.regex.Pattern;
|
|||
// https://www.wikidata.org/wiki/Property:P213
|
||||
public class ISNICleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])");
|
||||
|
||||
public static String clean(final String isni) {
|
||||
|
||||
Matcher m = PATTERN.matcher(isni);
|
||||
if (m.find()) {
|
||||
Matcher m = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])").matcher(isni);
|
||||
if (m.matches()) {
|
||||
return String.join("", m.group(1), m.group(2), m.group(3), m.group(4));
|
||||
} else {
|
||||
return "";
|
||||
|
|
|
@ -6,12 +6,10 @@ import java.util.regex.Pattern;
|
|||
|
||||
public class PICCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("\\d{9}");
|
||||
|
||||
public static String clean(final String pic) {
|
||||
|
||||
Matcher m = PATTERN.matcher(pic);
|
||||
if (m.find()) {
|
||||
Matcher m = Pattern.compile("\\d{9}").matcher(pic);
|
||||
if (m.matches()) {
|
||||
return m.group();
|
||||
} else {
|
||||
return "";
|
||||
|
|
|
@ -1,24 +1,13 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import java.util.regex.Matcher;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
public class PmcCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("PMC\\d{1,8}");
|
||||
|
||||
public static String clean(String pmc) {
|
||||
String s = pmc
|
||||
.replaceAll("\\s", "")
|
||||
.toUpperCase();
|
||||
|
||||
final Matcher m = PATTERN.matcher(s);
|
||||
|
||||
if (m.find()) {
|
||||
return m.group();
|
||||
}
|
||||
return "";
|
||||
return s.matches("^PMC\\d{1,8}$") ? s : "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -1,25 +1,16 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import java.util.regex.Matcher;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
// https://researchguides.stevens.edu/c.php?g=442331&p=6577176
|
||||
public class PmidCleaningRule {
|
||||
|
||||
public static final Pattern PATTERN = Pattern.compile("0*(\\d{1,8})");
|
||||
|
||||
public static String clean(String pmid) {
|
||||
String s = pmid
|
||||
.toLowerCase()
|
||||
.replaceAll("\\s", "");
|
||||
|
||||
final Matcher m = PATTERN.matcher(s);
|
||||
|
||||
if (m.find()) {
|
||||
return m.group(1);
|
||||
}
|
||||
return "";
|
||||
.replaceAll("\\s", "")
|
||||
.trim()
|
||||
.replaceAll("^0+", "");
|
||||
return s.matches("^\\d{1,8}$") ? s : "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -7,21 +7,12 @@ import java.util.regex.Pattern;
|
|||
// https://ror.readme.io/docs/ror-identifier-pattern
|
||||
public class RorCleaningRule {
|
||||
|
||||
public static final String ROR_PREFIX = "https://ror.org/";
|
||||
|
||||
private static final Pattern PATTERN = Pattern.compile("(?<ror>0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2})");
|
||||
|
||||
public static String clean(String ror) {
|
||||
String s = ror
|
||||
.replaceAll("\\s", "")
|
||||
.toLowerCase();
|
||||
|
||||
Matcher m = PATTERN.matcher(s);
|
||||
|
||||
if (m.find()) {
|
||||
return ROR_PREFIX + m.group("ror");
|
||||
}
|
||||
return "";
|
||||
Matcher m = Pattern.compile("0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2}").matcher(s);
|
||||
return m.matches() ? "https://ror.org/" + m.group() : "";
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -18,9 +18,9 @@
|
|||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "fi",
|
||||
"paramLongName": "filterInvisible",
|
||||
"paramDescription": "if true filters out invisible entities",
|
||||
"paramName": "c",
|
||||
"paramLongName": "graphTableClassName",
|
||||
"paramDescription": "the graph entity class name",
|
||||
"paramRequired": true
|
||||
}
|
||||
]
|
|
@ -50,13 +50,10 @@ object ScholixUtils extends Serializable {
|
|||
}
|
||||
}
|
||||
|
||||
def extractRelationDate(summary: ScholixSummary): String = {
|
||||
def extractRelationDate(summary: ScholixResource): String = {
|
||||
|
||||
summary.getPublicationDate
|
||||
|
||||
if (summary.getDate == null || summary.getDate.isEmpty)
|
||||
null
|
||||
else {
|
||||
summary.getDate.get(0)
|
||||
}
|
||||
}
|
||||
|
||||
def inverseRelationShip(rel: ScholixRelationship): ScholixRelationship = {
|
||||
|
@ -144,11 +141,7 @@ object ScholixUtils extends Serializable {
|
|||
s.setRelationship(inverseRelationShip(scholix.getRelationship))
|
||||
s.setSource(scholix.getTarget)
|
||||
s.setTarget(scholix.getSource)
|
||||
s.setIdentifier(
|
||||
DHPUtils.md5(
|
||||
s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
|
||||
)
|
||||
)
|
||||
updateId(s)
|
||||
s
|
||||
|
||||
}
|
||||
|
@ -187,6 +180,21 @@ object ScholixUtils extends Serializable {
|
|||
} else List()
|
||||
}
|
||||
|
||||
def updateId(scholix: Scholix): Scholix = {
|
||||
scholix.setIdentifier(
|
||||
generateIdentifier(
|
||||
scholix.getSource.getDnetIdentifier,
|
||||
scholix.getTarget.getDnetIdentifier,
|
||||
scholix.getRelationship.getName
|
||||
)
|
||||
)
|
||||
scholix
|
||||
}
|
||||
|
||||
def generateIdentifier(sourceId: String, targetId: String, relation: String): String = {
|
||||
DHPUtils.md5(s"$sourceId::$relation::$targetId")
|
||||
}
|
||||
|
||||
def generateCompleteScholix(scholix: Scholix, target: ScholixSummary): Scholix = {
|
||||
val s = new Scholix
|
||||
s.setPublicationDate(scholix.getPublicationDate)
|
||||
|
@ -195,11 +203,7 @@ object ScholixUtils extends Serializable {
|
|||
s.setRelationship(scholix.getRelationship)
|
||||
s.setSource(scholix.getSource)
|
||||
s.setTarget(generateScholixResourceFromSummary(target))
|
||||
s.setIdentifier(
|
||||
DHPUtils.md5(
|
||||
s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
|
||||
)
|
||||
)
|
||||
updateId(s)
|
||||
s
|
||||
}
|
||||
|
||||
|
@ -211,11 +215,7 @@ object ScholixUtils extends Serializable {
|
|||
s.setRelationship(scholix.getRelationship)
|
||||
s.setSource(scholix.getSource)
|
||||
s.setTarget(target)
|
||||
s.setIdentifier(
|
||||
DHPUtils.md5(
|
||||
s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
|
||||
)
|
||||
)
|
||||
updateId(s)
|
||||
s
|
||||
}
|
||||
|
||||
|
@ -232,7 +232,7 @@ object ScholixUtils extends Serializable {
|
|||
|
||||
if (summaryObject.getAuthor != null && !summaryObject.getAuthor.isEmpty) {
|
||||
val l: List[ScholixEntityId] =
|
||||
summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).toList
|
||||
summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).take(100).toList
|
||||
if (l.nonEmpty)
|
||||
r.setCreator(l.asJava)
|
||||
}
|
||||
|
@ -241,7 +241,7 @@ object ScholixUtils extends Serializable {
|
|||
r.setPublicationDate(summaryObject.getDate.get(0))
|
||||
if (summaryObject.getPublisher != null && !summaryObject.getPublisher.isEmpty) {
|
||||
val plist: List[ScholixEntityId] =
|
||||
summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).toList
|
||||
summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).take(100).toList
|
||||
|
||||
if (plist.nonEmpty)
|
||||
r.setPublisher(plist.asJava)
|
||||
|
@ -260,6 +260,7 @@ object ScholixUtils extends Serializable {
|
|||
"complete"
|
||||
)
|
||||
)
|
||||
.take(100)
|
||||
.toList
|
||||
|
||||
if (l.nonEmpty)
|
||||
|
@ -269,38 +270,38 @@ object ScholixUtils extends Serializable {
|
|||
r
|
||||
}
|
||||
|
||||
// def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
|
||||
// if (relation == null || source == null)
|
||||
// return null
|
||||
// val s = new Scholix
|
||||
// var l: List[ScholixEntityId] = extractCollectedFrom(relation)
|
||||
// if (l.isEmpty)
|
||||
// l = extractCollectedFrom(source)
|
||||
// if (l.isEmpty)
|
||||
// return null
|
||||
// s.setLinkprovider(l.asJava)
|
||||
// var d = extractRelationDate(relation)
|
||||
// if (d == null)
|
||||
// d = source.getPublicationDate
|
||||
//
|
||||
// s.setPublicationDate(d)
|
||||
//
|
||||
// if (source.getPublisher != null && !source.getPublisher.isEmpty) {
|
||||
// s.setPublisher(source.getPublisher)
|
||||
// }
|
||||
//
|
||||
// val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
|
||||
// if (semanticRelation == null)
|
||||
// return null
|
||||
// s.setRelationship(
|
||||
// new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
|
||||
// )
|
||||
// s.setSource(source)
|
||||
//
|
||||
// s
|
||||
// }
|
||||
|
||||
def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
|
||||
if (relation == null || source == null)
|
||||
return null
|
||||
val s = new Scholix
|
||||
var l: List[ScholixEntityId] = extractCollectedFrom(relation)
|
||||
if (l.isEmpty)
|
||||
l = extractCollectedFrom(source)
|
||||
if (l.isEmpty)
|
||||
return null
|
||||
s.setLinkprovider(l.asJava)
|
||||
var d = extractRelationDate(relation)
|
||||
if (d == null)
|
||||
d = source.getPublicationDate
|
||||
|
||||
s.setPublicationDate(d)
|
||||
|
||||
if (source.getPublisher != null && !source.getPublisher.isEmpty) {
|
||||
s.setPublisher(source.getPublisher)
|
||||
}
|
||||
|
||||
val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
|
||||
if (semanticRelation == null)
|
||||
return null
|
||||
s.setRelationship(
|
||||
new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
|
||||
)
|
||||
s.setSource(source)
|
||||
|
||||
s
|
||||
}
|
||||
|
||||
def scholixFromSource(relation: Relation, source: ScholixSummary): Scholix = {
|
||||
|
||||
if (relation == null || source == null)
|
||||
return null
|
||||
|
@ -322,11 +323,8 @@ object ScholixUtils extends Serializable {
|
|||
s.setPublicationDate(d)
|
||||
|
||||
if (source.getPublisher != null && !source.getPublisher.isEmpty) {
|
||||
val l: List[ScholixEntityId] = source.getPublisher.asScala
|
||||
.map { p =>
|
||||
new ScholixEntityId(p, null)
|
||||
}(collection.breakOut)
|
||||
|
||||
source.getPublisher
|
||||
val l: List[ScholixEntityId] = source.getPublisher.asScala.toList
|
||||
if (l.nonEmpty)
|
||||
s.setPublisher(l.asJava)
|
||||
}
|
||||
|
@ -337,7 +335,7 @@ object ScholixUtils extends Serializable {
|
|||
s.setRelationship(
|
||||
new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
|
||||
)
|
||||
s.setSource(generateScholixResourceFromSummary(source))
|
||||
s.setSource(source)
|
||||
|
||||
s
|
||||
}
|
||||
|
|
|
@ -15,7 +15,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
|
|||
|
||||
public class MdStoreClientTest {
|
||||
|
||||
// @Test
|
||||
@Test
|
||||
public void testMongoCollection() throws IOException {
|
||||
final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore");
|
||||
|
||||
|
|
|
@ -1,18 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class GridCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("grid.493784.5", GridCleaningRule.clean("grid.493784.5"));
|
||||
assertEquals("grid.493784.5x", GridCleaningRule.clean("grid.493784.5x"));
|
||||
assertEquals("grid.493784.5x", GridCleaningRule.clean("493784.5x"));
|
||||
assertEquals("", GridCleaningRule.clean("493x784.5x"));
|
||||
}
|
||||
|
||||
}
|
|
@ -1,19 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class ISNICleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("0000000463436020", ISNICleaningRule.clean("0000 0004 6343 6020"));
|
||||
assertEquals("0000000463436020", ISNICleaningRule.clean("0000000463436020"));
|
||||
assertEquals("", ISNICleaningRule.clean("Q30256598"));
|
||||
assertEquals("0000000493403529", ISNICleaningRule.clean("ISNI:0000000493403529"));
|
||||
assertEquals("000000008614884X", ISNICleaningRule.clean("0000 0000 8614 884X"));
|
||||
}
|
||||
|
||||
}
|
|
@ -1,19 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class PICCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("887624982", PICCleaningRule.clean("887624982"));
|
||||
assertEquals("", PICCleaningRule.clean("887 624982"));
|
||||
assertEquals("887624982", PICCleaningRule.clean(" 887624982 "));
|
||||
assertEquals("887624982", PICCleaningRule.clean(" 887624982x "));
|
||||
assertEquals("887624982", PICCleaningRule.clean(" 88762498200 "));
|
||||
}
|
||||
|
||||
}
|
|
@ -1,19 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class PmcCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("PMC1234", PmcCleaningRule.clean("PMC1234"));
|
||||
assertEquals("PMC1234", PmcCleaningRule.clean(" PMC1234"));
|
||||
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC12345678"));
|
||||
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC123456789"));
|
||||
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC 12345678"));
|
||||
}
|
||||
|
||||
}
|
|
@ -1,24 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class PmidCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
// leading zeros are removed
|
||||
assertEquals("1234", PmidCleaningRule.clean("01234"));
|
||||
// tolerant to spaces in the middle
|
||||
assertEquals("1234567", PmidCleaningRule.clean("0123 4567"));
|
||||
// stop parsing at first not numerical char
|
||||
assertEquals("123", PmidCleaningRule.clean("0123x4567"));
|
||||
// invalid id leading to empty result
|
||||
assertEquals("", PmidCleaningRule.clean("abc"));
|
||||
// valid id with zeroes in the number
|
||||
assertEquals("20794075", PmidCleaningRule.clean("20794075"));
|
||||
}
|
||||
|
||||
}
|
|
@ -1,17 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.schema.oaf.utils;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
class RorCleaningRuleTest {
|
||||
|
||||
@Test
|
||||
void testCleaning() {
|
||||
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("https://ror.org/05rpz9w55"));
|
||||
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("05rpz9w55"));
|
||||
assertEquals("", RorCleaningRule.clean("05rpz9w_55"));
|
||||
}
|
||||
|
||||
}
|
|
@ -7,7 +7,7 @@
|
|||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp</artifactId>
|
||||
<version>1.2.5-SNAPSHOT</version>
|
||||
<relativePath>../pom.xml</relativePath>
|
||||
<relativePath>../pom.xml</relativePath>
|
||||
</parent>
|
||||
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
|
|
|
@ -16,9 +16,8 @@ public class NGramUtils extends AbstractPaceFunctions {
|
|||
.loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");
|
||||
|
||||
public static String cleanupForOrdering(String s) {
|
||||
return (NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords) + StringUtils.repeat(" ", SIZE))
|
||||
.substring(0, SIZE)
|
||||
.replaceAll(" ", "");
|
||||
String result = NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords);
|
||||
return result.isEmpty() ? result : result.replace(" ", "");
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -2,6 +2,7 @@
|
|||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Collection;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
|
|
|
@ -13,19 +13,7 @@ import eu.dnetlib.pace.config.Config;
|
|||
public class SortedNgramPairs extends NgramPairs {
|
||||
|
||||
public SortedNgramPairs(Map<String, Integer> params) {
|
||||
super(params, false);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Collection<String> doApply(Config conf, String s) {
|
||||
|
||||
final List<String> tokens = Lists.newArrayList(Splitter.on(" ").omitEmptyStrings().trimResults().split(s));
|
||||
|
||||
Collections.sort(tokens);
|
||||
|
||||
return ngramPairs(
|
||||
Lists.newArrayList(getNgrams(Joiner.on(" ").join(tokens), param("ngramLen"), param("max") * 2, 1, 2)),
|
||||
param("max"));
|
||||
super(params, true);
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -49,18 +49,18 @@ public abstract class AbstractPaceFunctions {
|
|||
protected static Set<String> ngramBlacklist = loadFromClasspath("/eu/dnetlib/pace/config/ngram_blacklist.txt");
|
||||
|
||||
// html regex for normalization
|
||||
public static final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
|
||||
public final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
|
||||
|
||||
private static final String alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ";
|
||||
private static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
|
||||
private static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";
|
||||
|
||||
// doi prefix for normalization
|
||||
public static final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
|
||||
public final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
|
||||
|
||||
private static Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
|
||||
private Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
|
||||
|
||||
private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
|
||||
private Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
|
||||
|
||||
protected String concat(final List<String> l) {
|
||||
return Joiner.on(" ").skipNulls().join(l);
|
||||
|
@ -130,12 +130,10 @@ public abstract class AbstractPaceFunctions {
|
|||
|
||||
protected static String fixAliases(final String s) {
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
|
||||
s.chars().forEach(ch -> {
|
||||
for (final char ch : Lists.charactersOf(s)) {
|
||||
final int i = StringUtils.indexOf(aliases_from, ch);
|
||||
sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
|
||||
});
|
||||
|
||||
sb.append(i >= 0 ? aliases_to.charAt(i) : ch);
|
||||
}
|
||||
return sb.toString();
|
||||
}
|
||||
|
||||
|
@ -150,10 +148,9 @@ public abstract class AbstractPaceFunctions {
|
|||
protected String removeSymbols(final String s) {
|
||||
final StringBuilder sb = new StringBuilder();
|
||||
|
||||
s.chars().forEach(ch -> {
|
||||
sb.append(StringUtils.contains(alpha, ch) ? (char) ch : ' ');
|
||||
});
|
||||
|
||||
for (final char ch : Lists.charactersOf(s)) {
|
||||
sb.append(StringUtils.contains(alpha, ch) ? ch : " ");
|
||||
}
|
||||
return sb.toString().replaceAll("\\s+", " ");
|
||||
}
|
||||
|
||||
|
@ -237,8 +234,7 @@ public abstract class AbstractPaceFunctions {
|
|||
|
||||
final Set<String> h = Sets.newHashSet();
|
||||
try {
|
||||
for (final String s : IOUtils
|
||||
.readLines(NGramUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
|
||||
for (final String s : IOUtils.readLines(NGramUtils.class.getResourceAsStream(classpath))) {
|
||||
h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
|
||||
}
|
||||
} catch (final Throwable e) {
|
||||
|
@ -253,8 +249,7 @@ public abstract class AbstractPaceFunctions {
|
|||
|
||||
final Map<String, String> m = new HashMap<>();
|
||||
try {
|
||||
for (final String s : IOUtils
|
||||
.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
|
||||
for (final String s : IOUtils.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath))) {
|
||||
// string is like this: code;word1;word2;word3
|
||||
String[] line = s.split(";");
|
||||
String value = line[0];
|
||||
|
@ -347,7 +342,7 @@ public abstract class AbstractPaceFunctions {
|
|||
public static <T> String readFromClasspath(final String filename, final Class<T> clazz) {
|
||||
final StringWriter sw = new StringWriter();
|
||||
try {
|
||||
IOUtils.copy(clazz.getResourceAsStream(filename), sw, StandardCharsets.UTF_8);
|
||||
IOUtils.copy(clazz.getResourceAsStream(filename), sw);
|
||||
return sw.toString();
|
||||
} catch (final IOException e) {
|
||||
throw new RuntimeException("cannot load resource from classpath: " + filename);
|
||||
|
|
|
@ -4,6 +4,7 @@ package eu.dnetlib.pace.config;
|
|||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.function.Predicate;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
import eu.dnetlib.pace.model.ClusteringDef;
|
||||
import eu.dnetlib.pace.model.FieldDef;
|
||||
|
|
|
@ -13,8 +13,7 @@ import eu.dnetlib.pace.clustering.NGramUtils;
|
|||
public class RowDataOrderingComparator implements Comparator<Row> {
|
||||
|
||||
/** The comparator field. */
|
||||
private final int comparatorField;
|
||||
private final int identityFieldPosition;
|
||||
private int comparatorField;
|
||||
|
||||
/**
|
||||
* Instantiates a new map document comparator.
|
||||
|
@ -22,9 +21,8 @@ public class RowDataOrderingComparator implements Comparator<Row> {
|
|||
* @param comparatorField
|
||||
* the comparator field
|
||||
*/
|
||||
public RowDataOrderingComparator(final int comparatorField, int identityFieldPosition) {
|
||||
public RowDataOrderingComparator(final int comparatorField) {
|
||||
this.comparatorField = comparatorField;
|
||||
this.identityFieldPosition = identityFieldPosition;
|
||||
}
|
||||
|
||||
/*
|
||||
|
@ -53,10 +51,7 @@ public class RowDataOrderingComparator implements Comparator<Row> {
|
|||
|
||||
int res = to1.compareTo(to2);
|
||||
if (res == 0) {
|
||||
res = o1.compareTo(o2);
|
||||
if (res == 0) {
|
||||
return d1.getString(identityFieldPosition).compareTo(d2.getString(identityFieldPosition));
|
||||
}
|
||||
return o1.compareTo(o2);
|
||||
}
|
||||
|
||||
return res;
|
||||
|
|
|
@ -0,0 +1,644 @@
|
|||
package eu.dnetlib.pace.model
|
||||
|
||||
import com.jayway.jsonpath.{Configuration, JsonPath, Option}
|
||||
import eu.dnetlib.pace.config.{DedupConfig, Type}
|
||||
import eu.dnetlib.pace.tree.support.TreeProcessor
|
||||
import eu.dnetlib.pace.util.MapDocumentUtil.truncateValue
|
||||
import eu.dnetlib.pace.util.{BlockProcessor, MapDocumentUtil, SparkReporter}
|
||||
import org.apache.spark.SparkContext
|
||||
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
|
||||
import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
|
||||
import org.apache.spark.sql.{Column, Dataset, Encoder, Encoders, Row, functions}
|
||||
import org.apache.spark.sql.catalyst.expressions.{GenericRowWithSchema, Literal}
|
||||
import org.apache.spark.sql.expressions.{Aggregator, MutableAggregationBuffer, UserDefinedAggregateFunction, UserDefinedFunction, Window}
|
||||
import org.apache.spark.sql.types.{ArrayType, DataType, DataTypes, Metadata, StructField, StructType}
|
||||
|
||||
import java.util
|
||||
import java.util.function.Predicate
|
||||
import java.util.regex.Pattern
|
||||
import scala.collection.JavaConverters._
|
||||
import scala.collection.mutable
|
||||
import org.apache.spark.sql.functions.{col, lit, udf}
|
||||
|
||||
import java.util.Collections
|
||||
import java.util.stream.Collectors
|
||||
|
||||
case class SparkDedupConfig(conf: DedupConfig, numPartitions: Int) extends Serializable {
|
||||
|
||||
private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
|
||||
|
||||
private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
|
||||
|
||||
private val urlFilter = (s: String) => URL_REGEX.matcher(s).matches
|
||||
|
||||
val modelExtractor: (Dataset[String] => Dataset[Row]) = df => {
|
||||
|
||||
df.withColumn("mapDocument", rowFromJsonUDF.apply(df.col(df.columns(0))))
|
||||
.withColumn("identifier", new Column("mapDocument.identifier"))
|
||||
//.repartition(new Column("identifier"))
|
||||
.dropDuplicates("identifier")
|
||||
.select("mapDocument.*")
|
||||
|
||||
df.map(r => rowFromJson(r))(RowEncoder(rowDataType))
|
||||
.dropDuplicates("identifier")
|
||||
}
|
||||
|
||||
val generateClusters: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
val df_with_keys = conf
|
||||
.clusterings()
|
||||
.asScala
|
||||
.foldLeft(df_with_filters)((res, cd) => {
|
||||
res.withColumn(
|
||||
cd.getName + "_clustered",
|
||||
functions.explode_outer(
|
||||
clusterValuesUDF(cd).apply(
|
||||
functions.array(
|
||||
cd.getFields.asScala
|
||||
.map(f => res.col(if (conf.blacklists.containsKey(f)) f.concat("_filtered") else f)): _*
|
||||
)
|
||||
)
|
||||
)
|
||||
)
|
||||
})
|
||||
|
||||
// filter blacklisted values// filter blacklisted values
|
||||
// create one column per cluster prefix// create one column per cluster prefix
|
||||
|
||||
// GROUPING sets approach// GROUPING sets approach
|
||||
val tempTable = this.getClass.getSimpleName + "__generateClusters";
|
||||
|
||||
df_with_keys.createOrReplaceTempView(this.getClass.getSimpleName + "__generateClusters")
|
||||
|
||||
val keys = conf.clusterings().asScala.map(_.getName + "_clustered").mkString(",")
|
||||
val fields = rowDataType.fieldNames.mkString(",")
|
||||
|
||||
// Using SQL because GROUPING SETS are not available through Scala/Java DSL
|
||||
df_with_keys.sqlContext.sql(
|
||||
("SELECT coalesce(" + keys + ") as key, sort_array(collect_sort_slice(" + fields + ")) as block FROM " + tempTable + " WHERE coalesce(" + keys + ") IS NOT NULL GROUP BY GROUPING SETS (" + keys + ") HAVING size(block) > 1")
|
||||
)
|
||||
|
||||
}
|
||||
|
||||
val generateClustersWithDFAPI: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
|
||||
System.out.println(conf.getWf.getEntityType + "::" +conf.getWf.getSubEntityType)
|
||||
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
var relBlocks: Dataset[Row] = null
|
||||
|
||||
import scala.collection.JavaConversions._
|
||||
|
||||
for (cd <- conf.clusterings()) {
|
||||
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
|
||||
|
||||
for (fName <- cd.getFields()) {
|
||||
if (conf.blacklists.containsKey(fName))
|
||||
columns.add(new Column(fName + "_filtered"))
|
||||
else
|
||||
columns.add(new Column(fName))
|
||||
}
|
||||
|
||||
val tmp: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
|
||||
|
||||
|
||||
|
||||
|
||||
/*.select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
|
||||
.groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
|
||||
.agg(collectSortSliceAggregator.toColumn)
|
||||
.toDF("key", "block")
|
||||
.select(col("block.block").as("block"))*/
|
||||
|
||||
System.out.println(cd.getName)
|
||||
|
||||
val ds = tmp.groupBy("key")
|
||||
// .agg(functions.sort_array(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*)).as("block"))
|
||||
.agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
|
||||
//.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
|
||||
|
||||
|
||||
//df_with_filters.printSchema()
|
||||
//ds.printSchema()
|
||||
|
||||
if (relBlocks == null) relBlocks = ds
|
||||
else relBlocks = relBlocks.union(ds)
|
||||
}
|
||||
|
||||
// System.out.println()
|
||||
|
||||
relBlocks
|
||||
}
|
||||
|
||||
val generateClustersWithWindows: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
var relBlocks: Dataset[Row] = null
|
||||
|
||||
import scala.collection.JavaConversions._
|
||||
|
||||
for (cd <- conf.clusterings()) {
|
||||
System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType+ ": " + cd.getName + " " + cd.toString)
|
||||
|
||||
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
|
||||
|
||||
for (fName <- cd.getFields()) {
|
||||
if (conf.blacklists.containsKey(fName))
|
||||
columns.add(new Column(fName + "_filtered"))
|
||||
else
|
||||
columns.add(new Column(fName))
|
||||
}
|
||||
|
||||
// Add 'key' column with the value generated by the given clustering definition
|
||||
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
|
||||
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
|
||||
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(conf.getWf.getOrderField))))
|
||||
// filter out rows with position exceeding the maxqueuesize parameter
|
||||
.filter(col("position").leq(conf.getWf.getQueueMaxSize))
|
||||
.groupBy("key")
|
||||
.agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
|
||||
.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
|
||||
|
||||
if (relBlocks == null) relBlocks = ds
|
||||
else relBlocks = relBlocks.union(ds)
|
||||
}
|
||||
|
||||
relBlocks
|
||||
}
|
||||
|
||||
val generateClustersWithDFAPIMerged: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
import scala.collection.JavaConversions._
|
||||
|
||||
val keys = conf.clusterings().foldLeft(null : Column)((res, cd) => {
|
||||
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
|
||||
|
||||
for (fName <- cd.getFields()) {
|
||||
if (conf.blacklists.containsKey(fName))
|
||||
columns.add(new Column(fName + "_filtered"))
|
||||
else
|
||||
columns.add(new Column(fName))
|
||||
}
|
||||
|
||||
if (res != null)
|
||||
functions.array_union(res, clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*)))
|
||||
else
|
||||
clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))
|
||||
})
|
||||
|
||||
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(keys))
|
||||
.select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
|
||||
.groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
|
||||
.agg(collectSortSliceAggregator.toColumn)
|
||||
.toDF("key", "block")
|
||||
.select(col("block.block").as("block"))
|
||||
|
||||
/*.groupBy("key")
|
||||
.agg(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*).as("block"))*/
|
||||
.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
|
||||
|
||||
ds
|
||||
}
|
||||
|
||||
val generateClustersWithRDDReduction: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
var relBlocks: Dataset[Row] = null
|
||||
|
||||
import scala.collection.JavaConversions._
|
||||
|
||||
for (cd <- conf.clusterings()) {
|
||||
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
|
||||
|
||||
for (fName <- cd.getFields()) {
|
||||
if (conf.blacklists.containsKey(fName))
|
||||
columns.add(new Column(fName + "_filtered"))
|
||||
else
|
||||
columns.add(new Column(fName))
|
||||
}
|
||||
|
||||
val ds: Dataset[Row] = df.sparkSession.createDataFrame(df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
|
||||
.select(col("key"), functions.array(functions.struct(rowDataType.fieldNames.map(col): _*).as("value")))
|
||||
.rdd.keyBy(_.getString(0))
|
||||
.reduceByKey((a, b) => {
|
||||
val b1 = a.getSeq[Row](1)
|
||||
val b2 = b.getSeq[Row](1)
|
||||
|
||||
if (b1.size + b2.size > conf.getWf.getQueueMaxSize)
|
||||
Row(a.get(0), b1.union(b2).sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize))
|
||||
else
|
||||
Row(a.get(0), b1.union(b2))
|
||||
})
|
||||
.map(_._2)
|
||||
.filter(k => k.getSeq(1).size > 1),
|
||||
new StructType().add(StructField("key", DataTypes.StringType)).add(StructField("block", ArrayType(rowDataType)))
|
||||
)
|
||||
|
||||
if (relBlocks == null) relBlocks = ds
|
||||
else relBlocks = relBlocks.union(ds)
|
||||
}
|
||||
|
||||
relBlocks
|
||||
}
|
||||
|
||||
val printAnalytics: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
var relBlocks: Dataset[Row] = null
|
||||
|
||||
import scala.collection.JavaConversions._
|
||||
|
||||
for (cd <- conf.clusterings()) {
|
||||
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
|
||||
|
||||
for (fName <- cd.getFields()) {
|
||||
if (conf.blacklists.containsKey(fName))
|
||||
columns.add(new Column(fName + "_filtered"))
|
||||
else
|
||||
columns.add(new Column(fName))
|
||||
}
|
||||
|
||||
// Add 'key' column with the value generated by the given clustering definition
|
||||
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
|
||||
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
|
||||
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
|
||||
// filter out rows with position exceeding the maxqueuesize parameter
|
||||
.filter(col("position").lt(conf.getWf.getQueueMaxSize))
|
||||
|
||||
|
||||
// inner join to compute all combination of rows to compare
|
||||
// note the condition on position to obtain 'windowing': given a row this is compared at most with the next
|
||||
// SlidingWindowSize rows following the sort order
|
||||
val dsWithMatch = ds.as("l").join(ds.as("r"),
|
||||
col("l.key").equalTo(col("r.key")),
|
||||
"inner"
|
||||
)
|
||||
.filter((col("l.position").lt(col("r.position")))
|
||||
&& (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
|
||||
// Add match column with the result of comparison
|
||||
// dsWithMatch.show(false)
|
||||
|
||||
if (relBlocks == null)
|
||||
relBlocks = dsWithMatch
|
||||
else
|
||||
relBlocks = relBlocks.union(dsWithMatch)
|
||||
}
|
||||
|
||||
System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType)
|
||||
System.out.println("Total number of comparations: " + relBlocks.count())
|
||||
|
||||
|
||||
df
|
||||
}
|
||||
|
||||
|
||||
val generateAndProcessClustersWithJoins: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
var relBlocks: Dataset[Row] = null
|
||||
|
||||
import scala.collection.JavaConversions._
|
||||
|
||||
for (cd <- conf.clusterings()) {
|
||||
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
|
||||
|
||||
for (fName <- cd.getFields()) {
|
||||
if (conf.blacklists.containsKey(fName))
|
||||
columns.add(new Column(fName + "_filtered"))
|
||||
else
|
||||
columns.add(new Column(fName))
|
||||
}
|
||||
|
||||
// Add 'key' column with the value generated by the given clustering definition
|
||||
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
|
||||
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
|
||||
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
|
||||
// filter out rows with position exceeding the maxqueuesize parameter
|
||||
.filter(col("position").lt(conf.getWf.getQueueMaxSize))
|
||||
|
||||
|
||||
// inner join to compute all combination of rows to compare
|
||||
// note the condition on position to obtain 'windowing': given a row this is compared at most with the next
|
||||
// SlidingWindowSize rows following the sort order
|
||||
val dsWithMatch = ds.as("l").join(ds.as("r"),
|
||||
col("l.key").equalTo(col("r.key")),
|
||||
"inner"
|
||||
)
|
||||
.filter((col("l.position").lt(col("r.position")))
|
||||
&& (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
|
||||
// Add match column with the result of comparison
|
||||
.withColumn("match", udf[Boolean, Row, Row]((a, b) => {
|
||||
val treeProcessor = new TreeProcessor(conf)
|
||||
|
||||
treeProcessor.compare(a, b)
|
||||
}).apply(functions.struct(rowDataType.fieldNames.map(s => col("l.".concat(s))): _*), functions.struct(rowDataType.fieldNames.map(s => col("r.".concat(s))): _*)))
|
||||
.filter(col("match").equalTo(true))
|
||||
.select(col("l.identifier").as("from"), col("r.identifier").as("to"))
|
||||
|
||||
// dsWithMatch.show(false)
|
||||
|
||||
if (relBlocks == null)
|
||||
relBlocks = dsWithMatch
|
||||
else
|
||||
relBlocks = relBlocks.union(dsWithMatch)
|
||||
}
|
||||
|
||||
val res = relBlocks
|
||||
//.select(col("l.identifier").as("from"), col("r.identifier").as("to"))
|
||||
//.repartition()
|
||||
.distinct()
|
||||
|
||||
// res.show(false)
|
||||
res.select(functions.struct("from", "to"))
|
||||
}
|
||||
|
||||
val processClusters: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
|
||||
val entity = conf.getWf.getEntityType
|
||||
|
||||
df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
|
||||
.withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
|
||||
.select(functions.explode(new Column("relations")).as("relation"))
|
||||
//.repartition(new Column("relation"))
|
||||
.dropDuplicates("relation")
|
||||
}
|
||||
|
||||
val rowDataType: StructType = {
|
||||
// val unordered = conf.getPace.getModel.asScala.foldLeft(
|
||||
// new StructType()
|
||||
// )((resType, fdef) => {
|
||||
// resType.add(fdef.getType match {
|
||||
// case Type.List | Type.JSON =>
|
||||
// StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
|
||||
// case Type.DoubleArray =>
|
||||
// StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
|
||||
// case _ =>
|
||||
// StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
|
||||
// })
|
||||
// })
|
||||
//
|
||||
// conf.getPace.getModel.asScala.filterNot(_.getName.equals(conf.getWf.getOrderField)).foldLeft(
|
||||
// new StructType()
|
||||
// .add(unordered(conf.getWf.getOrderField))
|
||||
// .add(StructField("identifier", DataTypes.StringType, false, Metadata.empty))
|
||||
// )((resType, fdef) => resType.add(unordered(fdef.getName)))
|
||||
|
||||
val identifier = new FieldDef()
|
||||
identifier.setName("identifier")
|
||||
identifier.setType(Type.String)
|
||||
|
||||
(conf.getPace.getModel.asScala ++ Seq(identifier)).sortBy(_.getName)
|
||||
.foldLeft(
|
||||
new StructType()
|
||||
)((resType, fdef) => {
|
||||
resType.add(fdef.getType match {
|
||||
case Type.List | Type.JSON =>
|
||||
StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
|
||||
case Type.DoubleArray =>
|
||||
StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
|
||||
case _ =>
|
||||
StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
|
||||
})
|
||||
})
|
||||
|
||||
|
||||
|
||||
}
|
||||
|
||||
val identityFieldPosition: Int = rowDataType.fieldIndex("identifier")
|
||||
|
||||
val orderingFieldPosition: Int = rowDataType.fieldIndex(conf.getWf.getOrderField)
|
||||
|
||||
def rowFromJson(json: String) : Row = {
|
||||
val documentContext =
|
||||
JsonPath.using(Configuration.defaultConfiguration.addOptions(Option.SUPPRESS_EXCEPTIONS)).parse(json)
|
||||
val values = new Array[Any](rowDataType.size)
|
||||
|
||||
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
|
||||
|
||||
rowDataType.fieldNames.zipWithIndex.foldLeft(values) {
|
||||
case ((res, (fname, index))) => {
|
||||
val fdef = conf.getPace.getModelMap.get(fname)
|
||||
|
||||
if (fdef != null) {
|
||||
res(index) = fdef.getType match {
|
||||
case Type.String | Type.Int =>
|
||||
MapDocumentUtil.truncateValue(
|
||||
MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
|
||||
fdef.getLength
|
||||
)
|
||||
|
||||
case Type.URL =>
|
||||
var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
|
||||
if (!urlFilter(uv)) uv = ""
|
||||
uv
|
||||
|
||||
case Type.List | Type.JSON =>
|
||||
MapDocumentUtil.truncateList(
|
||||
MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
|
||||
fdef.getSize
|
||||
).toArray
|
||||
|
||||
case Type.StringConcat =>
|
||||
val jpaths = CONCAT_REGEX.split(fdef.getPath)
|
||||
|
||||
truncateValue(
|
||||
jpaths
|
||||
.map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
|
||||
.mkString(" "),
|
||||
fdef.getLength
|
||||
)
|
||||
|
||||
case Type.DoubleArray =>
|
||||
MapDocumentUtil.getJPathArray(fdef.getPath, json)
|
||||
}
|
||||
}
|
||||
|
||||
res
|
||||
}
|
||||
}
|
||||
|
||||
new GenericRowWithSchema(values, rowDataType)
|
||||
}
|
||||
|
||||
val rowFromJsonUDF = udf(rowFromJson(_), rowDataType)
|
||||
|
||||
def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
|
||||
|
||||
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
|
||||
|
||||
if (blacklist == null) {
|
||||
throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
|
||||
} else {
|
||||
fdef.getType match {
|
||||
case Type.List | Type.JSON =>
|
||||
udf[Array[String], Array[String]](values => {
|
||||
values.filter((v: String) => !blacklist.test(v))
|
||||
})
|
||||
|
||||
case _ =>
|
||||
udf[String, String](v => {
|
||||
if (blacklist.test(v)) ""
|
||||
else v
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
def clusterValuesUDF(cd: ClusteringDef) = {
|
||||
udf[mutable.WrappedArray[String], mutable.WrappedArray[Object]](values => {
|
||||
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala).map(cd.getName.concat(_))
|
||||
})
|
||||
}
|
||||
|
||||
def processBlock(implicit sc: SparkContext) = {
|
||||
val accumulators = SparkReporter.constructAccumulator(conf, sc)
|
||||
|
||||
udf[Array[Tuple2[String, String]], mutable.WrappedArray[Row]](block => {
|
||||
val reporter = new SparkReporter(accumulators)
|
||||
|
||||
val mapDocuments = block.asJava.stream
|
||||
.sorted(new RowDataOrderingComparator(orderingFieldPosition))
|
||||
.limit(conf.getWf.getQueueMaxSize)
|
||||
.collect(Collectors.toList[Row]())
|
||||
|
||||
new BlockProcessor(conf, identityFieldPosition, orderingFieldPosition).processSortedRows(mapDocuments, reporter)
|
||||
|
||||
reporter.getRelations.asScala.toArray
|
||||
}).asNondeterministic()
|
||||
}
|
||||
|
||||
val collectSortSliceAggregator : Aggregator[Row,Seq[Row], Row] = new Aggregator[Row, Seq[Row], Row] () {
|
||||
override def zero: Seq[Row] = Seq[Row]()
|
||||
|
||||
|
||||
override def reduce(buffer: Seq[Row], input: Row): Seq[Row] = {
|
||||
merge(buffer, Seq(input))
|
||||
}
|
||||
|
||||
override def merge(buffer: Seq[Row], toMerge: Seq[Row]): Seq[Row] = {
|
||||
val newBlock = buffer ++ toMerge
|
||||
|
||||
if (newBlock.size > conf.getWf.getQueueMaxSize)
|
||||
newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
|
||||
else
|
||||
newBlock
|
||||
}
|
||||
|
||||
override def finish(reduction: Seq[Row]): Row = {
|
||||
Row(reduction.toArray)
|
||||
}
|
||||
|
||||
override def bufferEncoder: Encoder[Seq[Row]] = Encoders.kryo[Seq[Row]]
|
||||
|
||||
override def outputEncoder: Encoder[Row] = RowEncoder.apply(new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true))
|
||||
}
|
||||
|
||||
val collectSortSliceUDAF : UserDefinedAggregateFunction = new UserDefinedAggregateFunction {
|
||||
override def inputSchema: StructType = rowDataType
|
||||
|
||||
override def bufferSchema: StructType = {
|
||||
new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true)
|
||||
}
|
||||
|
||||
override def dataType: DataType = DataTypes.createArrayType(rowDataType)
|
||||
|
||||
override def deterministic: Boolean = true
|
||||
|
||||
override def initialize(buffer: MutableAggregationBuffer): Unit = {
|
||||
buffer(0) = Seq[Row]()
|
||||
}
|
||||
|
||||
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
|
||||
val newBlock = buffer.getSeq[Row](0) ++ Seq(input)
|
||||
|
||||
if (newBlock.size > conf.getWf.getQueueMaxSize)
|
||||
buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
|
||||
else
|
||||
buffer(0) = newBlock
|
||||
}
|
||||
|
||||
override def merge(buffer: MutableAggregationBuffer, row: Row): Unit = {
|
||||
val newBlock = buffer.getSeq[Row](0) ++ row.getSeq[Row](0)
|
||||
|
||||
if (newBlock.size > conf.getWf.getQueueMaxSize)
|
||||
buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
|
||||
else
|
||||
buffer(0) = newBlock
|
||||
}
|
||||
|
||||
override def evaluate(buffer: Row): Any = {
|
||||
buffer.getSeq[Row](0)
|
||||
}
|
||||
}
|
||||
}
|
|
@ -1,131 +0,0 @@
|
|||
package eu.dnetlib.pace.model
|
||||
|
||||
import eu.dnetlib.pace.config.{DedupConfig, Type}
|
||||
import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
|
||||
import org.apache.spark.SparkContext
|
||||
import org.apache.spark.sql.catalyst.expressions.Literal
|
||||
import org.apache.spark.sql.expressions._
|
||||
import org.apache.spark.sql.functions.{col, lit, udf}
|
||||
import org.apache.spark.sql.types._
|
||||
import org.apache.spark.sql.{Column, Dataset, Row, functions}
|
||||
|
||||
import java.util.function.Predicate
|
||||
import java.util.stream.Collectors
|
||||
import scala.collection.JavaConversions._
|
||||
import scala.collection.JavaConverters._
|
||||
import scala.collection.mutable
|
||||
case class SparkDeduper(conf: DedupConfig) extends Serializable {
|
||||
|
||||
val model: SparkModel = SparkModel(conf)
|
||||
|
||||
val dedup: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
df.transform(filterAndCleanup)
|
||||
.transform(generateClustersWithCollect)
|
||||
.transform(processBlocks)
|
||||
}
|
||||
|
||||
|
||||
val filterAndCleanup: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
|
||||
if (conf.blacklists.containsKey(fdef.getName)) {
|
||||
res.withColumn(
|
||||
fdef.getName + "_filtered",
|
||||
filterColumnUDF(fdef).apply(new Column(fdef.getName))
|
||||
)
|
||||
} else {
|
||||
res
|
||||
}
|
||||
})
|
||||
|
||||
df_with_filters
|
||||
}
|
||||
|
||||
def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
|
||||
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
|
||||
|
||||
if (blacklist == null) {
|
||||
throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
|
||||
} else {
|
||||
fdef.getType match {
|
||||
case Type.List | Type.JSON =>
|
||||
udf[Array[String], Array[String]](values => {
|
||||
values.filter((v: String) => !blacklist.test(v))
|
||||
})
|
||||
|
||||
case _ =>
|
||||
udf[String, String](v => {
|
||||
if (blacklist.test(v)) ""
|
||||
else v
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => {
|
||||
var df_with_clustering_keys: Dataset[Row] = null
|
||||
|
||||
for ((cd, idx) <- conf.clusterings().zipWithIndex) {
|
||||
val inputColumns = cd.getFields().foldLeft(Seq[Column]())((acc, fName) => {
|
||||
val column = if (conf.blacklists.containsKey(fName))
|
||||
Seq(col(fName + "_filtered"))
|
||||
else
|
||||
Seq(col(fName))
|
||||
|
||||
acc ++ column
|
||||
})
|
||||
|
||||
// Add 'key' column with the value generated by the given clustering definition
|
||||
val ds: Dataset[Row] = df_with_filters
|
||||
.withColumn("clustering", lit(cd.getName + "::" + idx))
|
||||
.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
|
||||
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
|
||||
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
|
||||
|
||||
if (df_with_clustering_keys == null)
|
||||
df_with_clustering_keys = ds
|
||||
else
|
||||
df_with_clustering_keys = df_with_clustering_keys.union(ds)
|
||||
}
|
||||
|
||||
//TODO: analytics
|
||||
|
||||
val df_with_blocks = df_with_clustering_keys
|
||||
// filter out rows with position exceeding the maxqueuesize parameter
|
||||
.filter(col("position").leq(conf.getWf.getQueueMaxSize))
|
||||
.groupBy("clustering", "key")
|
||||
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
|
||||
.filter(functions.size(new Column("block")).gt(1))
|
||||
|
||||
df_with_blocks
|
||||
}
|
||||
|
||||
def clusterValuesUDF(cd: ClusteringDef) = {
|
||||
udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
|
||||
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
|
||||
})
|
||||
}
|
||||
|
||||
val processBlocks: (Dataset[Row] => Dataset[Row]) = df => {
|
||||
df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
|
||||
.withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
|
||||
.select(functions.explode(new Column("relations")).as("relation"))
|
||||
}
|
||||
|
||||
def processBlock(implicit sc: SparkContext) = {
|
||||
val accumulators = SparkReporter.constructAccumulator(conf, sc)
|
||||
|
||||
udf[Array[(String, String)], mutable.WrappedArray[Row]](block => {
|
||||
val reporter = new SparkReporter(accumulators)
|
||||
|
||||
val mapDocuments = block.asJava.stream()
|
||||
.sorted(new RowDataOrderingComparator(model.orderingFieldPosition, model.identityFieldPosition))
|
||||
.limit(conf.getWf.getQueueMaxSize)
|
||||
.collect(Collectors.toList[Row]())
|
||||
|
||||
new BlockProcessor(conf, model.identityFieldPosition, model.orderingFieldPosition).processSortedRows(mapDocuments, reporter)
|
||||
|
||||
reporter.getRelations.asScala.toArray
|
||||
}).asNondeterministic()
|
||||
}
|
||||
|
||||
}
|
|
@ -1,108 +0,0 @@
|
|||
package eu.dnetlib.pace.model
|
||||
|
||||
import com.jayway.jsonpath.{Configuration, JsonPath}
|
||||
import eu.dnetlib.pace.config.{DedupConfig, Type}
|
||||
import eu.dnetlib.pace.util.MapDocumentUtil
|
||||
import org.apache.spark.sql.catalyst.encoders.RowEncoder
|
||||
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
|
||||
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
|
||||
import org.apache.spark.sql.{Dataset, Row}
|
||||
|
||||
import java.util.regex.Pattern
|
||||
import scala.collection.JavaConverters._
|
||||
|
||||
case class SparkModel(conf: DedupConfig) {
|
||||
private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
|
||||
|
||||
private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
|
||||
|
||||
val identifierFieldName = "identifier"
|
||||
|
||||
val orderingFieldName = if (!conf.getWf.getOrderField.isEmpty) conf.getWf.getOrderField else identifierFieldName
|
||||
|
||||
val schema: StructType = {
|
||||
// create an implicit identifier field
|
||||
val identifier = new FieldDef()
|
||||
identifier.setName(identifierFieldName)
|
||||
identifier.setType(Type.String)
|
||||
|
||||
// Construct a Spark StructType representing the schema of the model
|
||||
(Seq(identifier) ++ conf.getPace.getModel.asScala)
|
||||
.foldLeft(
|
||||
new StructType()
|
||||
)((resType, fieldDef) => {
|
||||
resType.add(fieldDef.getType match {
|
||||
case Type.List | Type.JSON =>
|
||||
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
|
||||
case Type.DoubleArray =>
|
||||
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
|
||||
case _ =>
|
||||
StructField(fieldDef.getName, DataTypes.StringType, true, Metadata.empty)
|
||||
})
|
||||
})
|
||||
|
||||
|
||||
}
|
||||
|
||||
val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName)
|
||||
|
||||
val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
|
||||
|
||||
val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
|
||||
df.map(r => rowFromJson(r))(RowEncoder(schema))
|
||||
}
|
||||
|
||||
def rowFromJson(json: String): Row = {
|
||||
val documentContext =
|
||||
JsonPath.using(Configuration.defaultConfiguration.addOptions(com.jayway.jsonpath.Option.SUPPRESS_EXCEPTIONS)).parse(json)
|
||||
val values = new Array[Any](schema.size)
|
||||
|
||||
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
|
||||
|
||||
schema.fieldNames.zipWithIndex.foldLeft(values) {
|
||||
case ((res, (fname, index))) => {
|
||||
val fdef = conf.getPace.getModelMap.get(fname)
|
||||
|
||||
if (fdef != null) {
|
||||
res(index) = fdef.getType match {
|
||||
case Type.String | Type.Int =>
|
||||
MapDocumentUtil.truncateValue(
|
||||
MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
|
||||
fdef.getLength
|
||||
)
|
||||
|
||||
case Type.URL =>
|
||||
var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
|
||||
if (!URL_REGEX.matcher(uv).matches)
|
||||
uv = ""
|
||||
uv
|
||||
|
||||
case Type.List | Type.JSON =>
|
||||
MapDocumentUtil.truncateList(
|
||||
MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
|
||||
fdef.getSize
|
||||
).asScala
|
||||
|
||||
case Type.StringConcat =>
|
||||
val jpaths = CONCAT_REGEX.split(fdef.getPath)
|
||||
|
||||
MapDocumentUtil.truncateValue(
|
||||
jpaths
|
||||
.map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
|
||||
.mkString(" "),
|
||||
fdef.getLength
|
||||
)
|
||||
|
||||
case Type.DoubleArray =>
|
||||
MapDocumentUtil.getJPathArray(fdef.getPath, json)
|
||||
}
|
||||
}
|
||||
|
||||
res
|
||||
}
|
||||
}
|
||||
|
||||
new GenericRowWithSchema(values, schema)
|
||||
}
|
||||
}
|
||||
|
|
@ -1,8 +1,11 @@
|
|||
|
||||
package eu.dnetlib.pace.tree;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.google.common.base.Joiner;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
|
|
@ -5,6 +5,7 @@ import java.util.Map;
|
|||
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.tree.support.AbstractComparator;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
|
|
|
@ -4,6 +4,7 @@ package eu.dnetlib.pace.tree;
|
|||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractComparator;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
|
|
|
@ -4,6 +4,7 @@ package eu.dnetlib.pace.tree;
|
|||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractComparator;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
|
|
|
@ -4,6 +4,8 @@ package eu.dnetlib.pace.tree;
|
|||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractListComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
|
|
@ -4,6 +4,7 @@ package eu.dnetlib.pace.tree;
|
|||
import java.util.Map;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.tree.support.AbstractComparator;
|
||||
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
|
||||
import eu.dnetlib.pace.tree.support.ComparatorClass;
|
||||
|
||||
|
@ -43,25 +44,22 @@ public class StringContainsMatch extends AbstractStringComparator {
|
|||
STRING = STRING.toLowerCase();
|
||||
}
|
||||
|
||||
if (AGGREGATOR != null) {
|
||||
switch (AGGREGATOR) {
|
||||
case "AND":
|
||||
if (ca.contains(STRING) && cb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
case "OR":
|
||||
if (ca.contains(STRING) || cb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
case "XOR":
|
||||
if (ca.contains(STRING) ^ cb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
default:
|
||||
return 0.0;
|
||||
}
|
||||
switch (AGGREGATOR) {
|
||||
case "AND":
|
||||
if (ca.contains(STRING) && cb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
case "OR":
|
||||
if (ca.contains(STRING) || cb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
case "XOR":
|
||||
if (ca.contains(STRING) ^ cb.contains(STRING))
|
||||
return 1.0;
|
||||
break;
|
||||
default:
|
||||
return 0.0;
|
||||
}
|
||||
|
||||
return 0.0;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
|
||||
package eu.dnetlib.pace.tree.support;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
|
|
|
@ -4,9 +4,11 @@ package eu.dnetlib.pace.tree.support;
|
|||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
import com.wcohen.ss.AbstractStringDistance;
|
||||
|
||||
import eu.dnetlib.pace.config.Config;
|
||||
import eu.dnetlib.pace.config.Type;
|
||||
|
||||
abstract public class AbstractListComparator extends AbstractComparator<List<String>> {
|
||||
protected AbstractListComparator(Map<String, String> params) {
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
|
||||
package eu.dnetlib.pace.tree.support;
|
||||
|
||||
import java.util.AbstractList;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
|
|
@ -2,6 +2,8 @@
|
|||
package eu.dnetlib.pace.util;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collection;
|
||||
import java.util.Iterator;
|
||||
import java.util.List;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
@ -11,6 +13,7 @@ import org.apache.spark.sql.Row;
|
|||
import org.apache.spark.sql.types.ArrayType;
|
||||
import org.apache.spark.sql.types.DataType;
|
||||
import org.apache.spark.sql.types.StringType;
|
||||
import org.apache.spark.sql.types.StructType;
|
||||
|
||||
import eu.dnetlib.pace.config.DedupConfig;
|
||||
import eu.dnetlib.pace.config.WfConfig;
|
||||
|
|
|
@ -18,7 +18,6 @@ package eu.dnetlib.pace.util;
|
|||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Diff Match and Patch
|
||||
* Copyright 2018 The diff-match-patch Authors.
|
||||
|
|
|
@ -2,20 +2,20 @@
|
|||
package eu.dnetlib.pace.util;
|
||||
|
||||
import java.math.BigDecimal;
|
||||
import java.util.ArrayList;
|
||||
import java.util.LinkedHashMap;
|
||||
import java.util.List;
|
||||
import java.util.concurrent.ConcurrentHashMap;
|
||||
import java.util.*;
|
||||
import java.util.function.Predicate;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import com.fasterxml.jackson.core.JsonProcessingException;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.jayway.jsonpath.Configuration;
|
||||
import com.jayway.jsonpath.DocumentContext;
|
||||
import com.jayway.jsonpath.JsonPath;
|
||||
import com.jayway.jsonpath.spi.cache.Cache;
|
||||
import com.jayway.jsonpath.spi.cache.CacheProvider;
|
||||
import com.jayway.jsonpath.Option;
|
||||
|
||||
import eu.dnetlib.pace.config.DedupConfig;
|
||||
import eu.dnetlib.pace.config.Type;
|
||||
import eu.dnetlib.pace.model.*;
|
||||
import net.minidev.json.JSONArray;
|
||||
|
||||
public class MapDocumentUtil {
|
||||
|
@ -23,20 +23,47 @@ public class MapDocumentUtil {
|
|||
public static final String URL_REGEX = "^(http|https|ftp)\\://.*";
|
||||
public static Predicate<String> urlFilter = s -> s.trim().matches(URL_REGEX);
|
||||
|
||||
static {
|
||||
CacheProvider.setCache(new Cache() {
|
||||
private final ConcurrentHashMap<String, JsonPath> jsonPathCache = new ConcurrentHashMap();
|
||||
public static List<String> getJPathList(String path, String json, Type type) {
|
||||
if (type == Type.List)
|
||||
return JsonPath
|
||||
.using(
|
||||
Configuration
|
||||
.defaultConfiguration()
|
||||
.addOptions(Option.ALWAYS_RETURN_LIST, Option.SUPPRESS_EXCEPTIONS))
|
||||
.parse(json)
|
||||
.read(path);
|
||||
Object jresult;
|
||||
List<String> result = new ArrayList<>();
|
||||
try {
|
||||
jresult = JsonPath.read(json, path);
|
||||
} catch (Throwable e) {
|
||||
return result;
|
||||
}
|
||||
if (jresult instanceof JSONArray) {
|
||||
|
||||
@Override
|
||||
public JsonPath get(String key) {
|
||||
return jsonPathCache.get(key);
|
||||
}
|
||||
((JSONArray) jresult).forEach(it -> {
|
||||
|
||||
try {
|
||||
result.add(new ObjectMapper().writeValueAsString(it));
|
||||
} catch (JsonProcessingException e) {
|
||||
|
||||
}
|
||||
});
|
||||
return result;
|
||||
}
|
||||
|
||||
if (jresult instanceof LinkedHashMap) {
|
||||
try {
|
||||
result.add(new ObjectMapper().writeValueAsString(jresult));
|
||||
} catch (JsonProcessingException e) {
|
||||
|
||||
@Override
|
||||
public void put(String key, JsonPath value) {
|
||||
jsonPathCache.put(key, value);
|
||||
}
|
||||
});
|
||||
return result;
|
||||
}
|
||||
if (jresult instanceof String) {
|
||||
result.add((String) jresult);
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
public static String getJPathString(final String jsonPath, final String json) {
|
||||
|
@ -117,11 +144,6 @@ public class MapDocumentUtil {
|
|||
return result;
|
||||
}
|
||||
|
||||
if (type == Type.List && jresult instanceof List) {
|
||||
((List<?>) jresult).forEach(x -> result.add(x.toString()));
|
||||
return result;
|
||||
}
|
||||
|
||||
if (jresult instanceof JSONArray) {
|
||||
((JSONArray) jresult).forEach(it -> {
|
||||
try {
|
||||
|
|
|
@ -10,6 +10,7 @@ import org.apache.spark.SparkContext;
|
|||
import org.apache.spark.util.LongAccumulator;
|
||||
|
||||
import eu.dnetlib.pace.config.DedupConfig;
|
||||
import eu.dnetlib.pace.util.Reporter;
|
||||
import scala.Serializable;
|
||||
import scala.Tuple2;
|
||||
|
||||
|
|
|
@ -2,12 +2,14 @@
|
|||
package eu.dnetlib.pace.clustering;
|
||||
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.junit.jupiter.api.*;
|
||||
|
||||
import com.google.common.collect.Lists;
|
||||
import com.google.common.collect.Maps;
|
||||
import com.google.common.collect.Sets;
|
||||
|
||||
import eu.dnetlib.pace.AbstractPaceTest;
|
||||
import eu.dnetlib.pace.common.AbstractPaceFunctions;
|
||||
|
|
|
@ -6,7 +6,6 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
|
|||
import java.util.*;
|
||||
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.junit.jupiter.api.TestInstance;
|
||||
|
||||
|
@ -23,18 +22,14 @@ public class ComparatorTest extends AbstractPaceTest {
|
|||
|
||||
@BeforeAll
|
||||
public void setup() {
|
||||
conf = DedupConfig
|
||||
.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
|
||||
}
|
||||
|
||||
@BeforeEach
|
||||
public void beforeEachTest() {
|
||||
params = new HashMap<>();
|
||||
params.put("weight", "1.0");
|
||||
params.put("surname_th", "0.99");
|
||||
params.put("name_th", "0.95");
|
||||
params.put("jpath_value", "$.value");
|
||||
params.put("jpath_classid", "$.qualifier.classid");
|
||||
conf = DedupConfig
|
||||
.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
|
||||
}
|
||||
|
||||
@Test
|
||||
|
@ -68,10 +63,7 @@ public class ComparatorTest extends AbstractPaceTest {
|
|||
.distance(
|
||||
"Politechniki Warszawskiej (Warsaw University of Technology)", "Warsaw University of Technology",
|
||||
conf));
|
||||
|
||||
// failing becasuse 'Allen' is a transliterrated greek stopword
|
||||
// assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
|
||||
assertEquals(-1.0, cityMatch.distance("Washington (United States)", "United States Military Academy", conf));
|
||||
assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
|
||||
}
|
||||
|
||||
@Test
|
||||
|
@ -86,7 +78,7 @@ public class ComparatorTest extends AbstractPaceTest {
|
|||
assertEquals(1.0, keywordMatch.distance("Polytechnic University of Turin", "POLITECNICO DI TORINO", conf));
|
||||
assertEquals(1.0, keywordMatch.distance("Istanbul Commerce University", "İstanbul Ticarət Universiteti", conf));
|
||||
assertEquals(1.0, keywordMatch.distance("Franklin College", "Concordia College", conf));
|
||||
assertEquals(2.0 / 3.0, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
|
||||
assertEquals(0.5, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
|
||||
assertEquals(0.5, keywordMatch.distance("University College London", "University of London", conf));
|
||||
assertEquals(0.5, keywordMatch.distance("Washington State University", "University of Washington", conf));
|
||||
assertEquals(-1.0, keywordMatch.distance("Allen (United States)", "United States Military Academy", conf));
|
||||
|
@ -120,7 +112,7 @@ public class ComparatorTest extends AbstractPaceTest {
|
|||
public void stringContainsMatchTest() {
|
||||
|
||||
params.put("string", "openorgs");
|
||||
params.put("aggregator", "XOR");
|
||||
params.put("bool", "XOR");
|
||||
params.put("caseSensitive", "false");
|
||||
|
||||
StringContainsMatch stringContainsMatch = new StringContainsMatch(params);
|
||||
|
@ -128,7 +120,7 @@ public class ComparatorTest extends AbstractPaceTest {
|
|||
assertEquals(0.0, stringContainsMatch.distance("openorgs", "openorgs", conf));
|
||||
|
||||
params.put("string", "openorgs");
|
||||
params.put("aggregator", "AND");
|
||||
params.put("bool", "AND");
|
||||
params.put("caseSensitive", "false");
|
||||
|
||||
stringContainsMatch = new StringContainsMatch(params);
|
||||
|
|
|
@ -6,8 +6,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
|
|||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.junit.jupiter.api.*;
|
||||
|
||||
import eu.dnetlib.pace.model.Person;
|
||||
import jdk.nashorn.internal.ir.annotations.Ignore;
|
||||
|
|
|
@ -1,72 +0,0 @@
|
|||
# Action Management Framework
|
||||
|
||||
This module implements the oozie workflow for the integration of pre-built contents into the OpenAIRE Graph.
|
||||
|
||||
Such contents can be
|
||||
|
||||
* brand new, non-existing records to be introduced as nodes of the graph
|
||||
* updates (or enrichment) for records that does exist in the graph (e.g. a new subject term for a publication)
|
||||
* relations among existing nodes
|
||||
|
||||
The actionset contents are organised into logical containers, each of them can contain multiple versions contents and is characterised by
|
||||
|
||||
* a name
|
||||
* an identifier
|
||||
* the paths on HDFS where each version of the contents is stored
|
||||
|
||||
Each version is then characterised by
|
||||
|
||||
* the creation date
|
||||
* the last update date
|
||||
* the indication where it is the latest one or it is an expired version, candidate for garbage collection
|
||||
|
||||
## ActionSet serialization
|
||||
|
||||
Each actionset version contains records compliant to the graph internal data model, i.e. subclasses of `eu.dnetlib.dhp.schema.oaf.Oaf`,
|
||||
defined in the external schemas module
|
||||
|
||||
```
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>${dhp-schemas.artifact}</artifactId>
|
||||
<version>${dhp-schemas.version}</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
When the actionset contains a relationship, the model class to use is `eu.dnetlib.dhp.schema.oaf.Relation`, otherwise
|
||||
when the actionset contains an entity, it is a `eu.dnetlib.dhp.schema.oaf.OafEntity` or one of its subclasses
|
||||
`Datasource`, `Organization`, `Project`, `Result` (or one of its subclasses `Publication`, `Dataset`, etc...).
|
||||
|
||||
Then, each OpenAIRE Graph model class instance must be wrapped using the class `eu.dnetlib.dhp.schema.action.AtomicAction`, a generic
|
||||
container that defines two attributes
|
||||
|
||||
* `T payload` the OpenAIRE Graph class instance containing the data;
|
||||
* `Class<T> clazz` must contain the class whose instance is contained in the payload.
|
||||
|
||||
Each AtomicAction can be then serialised in JSON format using `com.fasterxml.jackson.databind.ObjectMapper` from
|
||||
|
||||
```
|
||||
<dependency>
|
||||
<groupId>com.fasterxml.jackson.core</groupId>
|
||||
<artifactId>jackson-databind</artifactId>
|
||||
<version>${dhp.jackson.version}</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
Then, the JSON serialization must be stored as a GZip compressed sequence file (`org.apache.hadoop.mapred.SequenceFileOutputFormat`).
|
||||
As such, it contains a set of tuples, a key and a value defined as `org.apache.hadoop.io.Text` where
|
||||
|
||||
* the `key` must be set to the class canonical name contained in the `AtomicAction`;
|
||||
* the `value` must be set to the AtomicAction JSON serialization.
|
||||
|
||||
The following snippet provides an example of how create an actionset version of Relation records:
|
||||
|
||||
```
|
||||
rels // JavaRDD<Relation>
|
||||
.map(relation -> new AtomicAction<Relation>(Relation.class, relation))
|
||||
.mapToPair(
|
||||
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
|
||||
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
|
||||
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
|
||||
```
|
||||
|
|
@ -63,6 +63,10 @@
|
|||
<groupId>eu.dnetlib</groupId>
|
||||
<artifactId>dnet-openaireplus-mapping-utils</artifactId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<groupId>eu.dnetlib</groupId>
|
||||
<artifactId>dnet-index-solr-common</artifactId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<groupId>saxonica</groupId>
|
||||
<artifactId>saxon</artifactId>
|
||||
|
|
|
@ -20,7 +20,6 @@ import org.apache.spark.sql.SparkSession;
|
|||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import com.fasterxml.jackson.databind.DeserializationFeature;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException;
|
||||
|
||||
|
@ -34,8 +33,7 @@ import eu.dnetlib.dhp.schema.oaf.*;
|
|||
public class PromoteActionPayloadForGraphTableJob {
|
||||
private static final Logger logger = LoggerFactory.getLogger(PromoteActionPayloadForGraphTableJob.class);
|
||||
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
|
||||
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
|
||||
public static void main(String[] args) throws Exception {
|
||||
String jsonConfiguration = IOUtils
|
||||
|
|
|
@ -31,7 +31,6 @@ import org.mockito.Mock;
|
|||
import org.mockito.Mockito;
|
||||
import org.mockito.junit.jupiter.MockitoExtension;
|
||||
|
||||
import com.fasterxml.jackson.databind.DeserializationFeature;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.actionmanager.ISClient;
|
||||
|
@ -47,8 +46,7 @@ public class PartitionActionSetsByPayloadTypeJobTest {
|
|||
private static Configuration configuration;
|
||||
private static SparkSession spark;
|
||||
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
|
||||
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
|
||||
private static final StructType ATOMIC_ACTION_SCHEMA = StructType$.MODULE$
|
||||
.apply(
|
||||
|
|
|
@ -25,7 +25,6 @@ import org.junit.jupiter.params.ParameterizedTest;
|
|||
import org.junit.jupiter.params.provider.Arguments;
|
||||
import org.junit.jupiter.params.provider.MethodSource;
|
||||
|
||||
import com.fasterxml.jackson.databind.DeserializationFeature;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.schema.common.ModelSupport;
|
||||
|
@ -42,8 +41,7 @@ public class PromoteActionPayloadForGraphTableJobTest {
|
|||
private Path inputActionPayloadRootDir;
|
||||
private Path outputDir;
|
||||
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
|
||||
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
|
||||
@BeforeAll
|
||||
public static void beforeAll() {
|
||||
|
@ -156,10 +154,6 @@ public class PromoteActionPayloadForGraphTableJobTest {
|
|||
List<? extends Oaf> actualOutputRows = readGraphTableFromJobOutput(outputGraphTableDir.toString(), rowClazz)
|
||||
.collectAsList()
|
||||
.stream()
|
||||
.map(s -> {
|
||||
s.setLastupdatetimestamp(0L);
|
||||
return s;
|
||||
})
|
||||
.sorted(Comparator.comparingInt(Object::hashCode))
|
||||
.collect(Collectors.toList());
|
||||
String expectedOutputGraphTableJsonDumpPath = resultFileLocation(strategy, rowClazz, actionPayloadClazz);
|
||||
|
@ -172,10 +166,6 @@ public class PromoteActionPayloadForGraphTableJobTest {
|
|||
expectedOutputGraphTableJsonDumpFile.toString(), rowClazz)
|
||||
.collectAsList()
|
||||
.stream()
|
||||
.map(s -> {
|
||||
s.setLastupdatetimestamp(0L);
|
||||
return s;
|
||||
})
|
||||
.sorted(Comparator.comparingInt(Object::hashCode))
|
||||
.collect(Collectors.toList());
|
||||
assertIterableEquals(expectedOutputRows, actualOutputRows);
|
||||
|
|
|
@ -79,8 +79,8 @@
|
|||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.scala-lang.modules</groupId>
|
||||
<artifactId>scala-xml_${scala.binary.version}</artifactId>
|
||||
<version>${scala-xml.version}</version>
|
||||
<artifactId>scala-xml_2.12</artifactId>
|
||||
<version>2.1.0</version>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
|
|
|
@ -11,7 +11,6 @@ import org.apache.spark.sql.SparkSession;
|
|||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.common.HdfsSupport;
|
||||
import eu.dnetlib.dhp.schema.common.ModelConstants;
|
||||
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
|
||||
import eu.dnetlib.dhp.schema.oaf.Subject;
|
||||
|
@ -94,9 +93,4 @@ public class Constants {
|
|||
return s;
|
||||
|
||||
}
|
||||
|
||||
public static void removeOutputDir(SparkSession spark, String path) {
|
||||
HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -1,162 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.actionmanager.bipaffiliations;
|
||||
|
||||
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
|
||||
|
||||
import java.io.Serializable;
|
||||
import java.util.Arrays;
|
||||
import java.util.List;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.hadoop.io.Text;
|
||||
import org.apache.hadoop.io.compress.GzipCodec;
|
||||
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.function.FlatMapFunction;
|
||||
import org.apache.spark.sql.*;
|
||||
import org.apache.spark.sql.Dataset;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.actionmanager.Constants;
|
||||
import eu.dnetlib.dhp.actionmanager.ror.GenerateRorActionSetJob;
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.schema.action.AtomicAction;
|
||||
import eu.dnetlib.dhp.schema.common.ModelConstants;
|
||||
import eu.dnetlib.dhp.schema.oaf.*;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
|
||||
import scala.Tuple2;
|
||||
|
||||
/**
|
||||
* Creates action sets for Crossref affiliation relations inferred by BIP!
|
||||
*/
|
||||
public class PrepareAffiliationRelations implements Serializable {
|
||||
|
||||
private static final Logger log = LoggerFactory.getLogger(PrepareAffiliationRelations.class);
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
private static final String ID_PREFIX = "50|doi_________::";
|
||||
public static final String BIP_AFFILIATIONS_CLASSID = "result:organization:bipinference";
|
||||
public static final String BIP_AFFILIATIONS_CLASSNAME = "Affiliation relation inferred by BIP!";
|
||||
public static final String BIP_INFERENCE_PROVENANCE = "bip:affiliation:crossref";
|
||||
|
||||
public static <I extends Result> void main(String[] args) throws Exception {
|
||||
|
||||
String jsonConfiguration = IOUtils
|
||||
.toString(
|
||||
PrepareAffiliationRelations.class
|
||||
.getResourceAsStream(
|
||||
"/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json"));
|
||||
|
||||
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
|
||||
parser.parseArgument(args);
|
||||
|
||||
Boolean isSparkSessionManaged = Constants.isSparkSessionManaged(parser);
|
||||
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
|
||||
|
||||
final String inputPath = parser.get("inputPath");
|
||||
log.info("inputPath {}: ", inputPath);
|
||||
|
||||
final String outputPath = parser.get("outputPath");
|
||||
log.info("outputPath {}: ", outputPath);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
|
||||
runWithSparkSession(
|
||||
conf,
|
||||
isSparkSessionManaged,
|
||||
spark -> {
|
||||
Constants.removeOutputDir(spark, outputPath);
|
||||
prepareAffiliationRelations(spark, inputPath, outputPath);
|
||||
});
|
||||
}
|
||||
|
||||
private static <I extends Result> void prepareAffiliationRelations(SparkSession spark, String inputPath,
|
||||
String outputPath) {
|
||||
|
||||
// load and parse affiliation relations from HDFS
|
||||
Dataset<Row> df = spark
|
||||
.read()
|
||||
.schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:STRING,`Confidence`:DOUBLE>>")
|
||||
.json(inputPath);
|
||||
|
||||
// unroll nested arrays
|
||||
df = df
|
||||
.withColumn("matching", functions.explode(new Column("Matchings")))
|
||||
.select(
|
||||
new Column("DOI").as("doi"),
|
||||
new Column("matching.RORid").as("rorid"),
|
||||
new Column("matching.Confidence").as("confidence"));
|
||||
|
||||
// prepare action sets for affiliation relations
|
||||
df
|
||||
.toJavaRDD()
|
||||
.flatMap((FlatMapFunction<Row, Relation>) row -> {
|
||||
|
||||
// DOI to OpenAIRE id
|
||||
final String paperId = ID_PREFIX
|
||||
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi")));
|
||||
|
||||
// ROR id to OpenAIRE id
|
||||
final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid"));
|
||||
|
||||
Qualifier qualifier = OafMapperUtils
|
||||
.qualifier(
|
||||
BIP_AFFILIATIONS_CLASSID,
|
||||
BIP_AFFILIATIONS_CLASSNAME,
|
||||
ModelConstants.DNET_PROVENANCE_ACTIONS,
|
||||
ModelConstants.DNET_PROVENANCE_ACTIONS);
|
||||
|
||||
// format data info; setting `confidence` into relation's `trust`
|
||||
DataInfo dataInfo = OafMapperUtils
|
||||
.dataInfo(
|
||||
false,
|
||||
BIP_INFERENCE_PROVENANCE,
|
||||
true,
|
||||
false,
|
||||
qualifier,
|
||||
Double.toString(row.getAs("confidence")));
|
||||
|
||||
List<KeyValue> collectedfrom = OafMapperUtils.listKeyValues(ModelConstants.CROSSREF_ID, "Crossref");
|
||||
|
||||
// return bi-directional relations
|
||||
return getAffiliationRelationPair(paperId, affId, collectedfrom, dataInfo).iterator();
|
||||
|
||||
})
|
||||
.map(p -> new AtomicAction(Relation.class, p))
|
||||
.mapToPair(
|
||||
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
|
||||
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
|
||||
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
|
||||
|
||||
}
|
||||
|
||||
private static List<Relation> getAffiliationRelationPair(String paperId, String affId, List<KeyValue> collectedfrom,
|
||||
DataInfo dataInfo) {
|
||||
return Arrays
|
||||
.asList(
|
||||
OafMapperUtils
|
||||
.getRelation(
|
||||
paperId,
|
||||
affId,
|
||||
ModelConstants.RESULT_ORGANIZATION,
|
||||
ModelConstants.AFFILIATION,
|
||||
ModelConstants.HAS_AUTHOR_INSTITUTION,
|
||||
collectedfrom,
|
||||
dataInfo,
|
||||
null),
|
||||
OafMapperUtils
|
||||
.getRelation(
|
||||
affId,
|
||||
paperId,
|
||||
ModelConstants.RESULT_ORGANIZATION,
|
||||
ModelConstants.AFFILIATION,
|
||||
ModelConstants.IS_AUTHOR_INSTITUTION_OF,
|
||||
collectedfrom,
|
||||
dataInfo,
|
||||
null));
|
||||
}
|
||||
}
|
|
@ -6,14 +6,13 @@ import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
|
|||
|
||||
import java.io.Serializable;
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.hadoop.io.Text;
|
||||
import org.apache.hadoop.io.compress.GzipCodec;
|
||||
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.JavaPairRDD;
|
||||
import org.apache.spark.api.java.JavaRDD;
|
||||
import org.apache.spark.api.java.JavaSparkContext;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
|
@ -25,9 +24,8 @@ import org.slf4j.LoggerFactory;
|
|||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipProjectModel;
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.common.HdfsSupport;
|
||||
import eu.dnetlib.dhp.schema.action.AtomicAction;
|
||||
|
@ -42,6 +40,7 @@ import scala.Tuple2;
|
|||
*/
|
||||
public class SparkAtomicActionScoreJob implements Serializable {
|
||||
|
||||
private static final String DOI = "doi";
|
||||
private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJob.class);
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
|
||||
|
@ -57,17 +56,18 @@ public class SparkAtomicActionScoreJob implements Serializable {
|
|||
|
||||
parser.parseArgument(args);
|
||||
|
||||
Boolean isSparkSessionManaged = isSparkSessionManaged(parser);
|
||||
Boolean isSparkSessionManaged = Optional
|
||||
.ofNullable(parser.get("isSparkSessionManaged"))
|
||||
.map(Boolean::valueOf)
|
||||
.orElse(Boolean.TRUE);
|
||||
|
||||
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
|
||||
|
||||
final String resultsInputPath = parser.get("resultsInputPath");
|
||||
log.info("resultsInputPath: {}", resultsInputPath);
|
||||
|
||||
final String projectsInputPath = parser.get("projectsInputPath");
|
||||
log.info("projectsInputPath: {}", projectsInputPath);
|
||||
final String inputPath = parser.get("inputPath");
|
||||
log.info("inputPath {}: ", inputPath);
|
||||
|
||||
final String outputPath = parser.get("outputPath");
|
||||
log.info("outputPath: {}", outputPath);
|
||||
log.info("outputPath {}: ", outputPath);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
|
||||
|
@ -76,45 +76,17 @@ public class SparkAtomicActionScoreJob implements Serializable {
|
|||
isSparkSessionManaged,
|
||||
spark -> {
|
||||
removeOutputDir(spark, outputPath);
|
||||
|
||||
JavaPairRDD<Text, Text> resultsRDD = prepareResults(spark, resultsInputPath, outputPath);
|
||||
JavaPairRDD<Text, Text> projectsRDD = prepareProjects(spark, projectsInputPath, outputPath);
|
||||
|
||||
resultsRDD
|
||||
.union(projectsRDD)
|
||||
.saveAsHadoopFile(
|
||||
outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
|
||||
prepareResults(spark, inputPath, outputPath);
|
||||
});
|
||||
}
|
||||
|
||||
private static <I extends Project> JavaPairRDD<Text, Text> prepareProjects(SparkSession spark, String inputPath,
|
||||
String outputPath) {
|
||||
|
||||
// read input bip project scores
|
||||
Dataset<BipProjectModel> projectScores = readPath(spark, inputPath, BipProjectModel.class);
|
||||
|
||||
return projectScores.map((MapFunction<BipProjectModel, Project>) bipProjectScores -> {
|
||||
Project project = new Project();
|
||||
project.setId(bipProjectScores.getProjectId());
|
||||
project.setMeasures(bipProjectScores.toMeasures());
|
||||
return project;
|
||||
}, Encoders.bean(Project.class))
|
||||
.toJavaRDD()
|
||||
.map(p -> new AtomicAction(Project.class, p))
|
||||
.mapToPair(
|
||||
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
|
||||
new Text(OBJECT_MAPPER.writeValueAsString(aa))));
|
||||
|
||||
}
|
||||
|
||||
private static <I extends Result> JavaPairRDD<Text, Text> prepareResults(SparkSession spark, String bipScorePath,
|
||||
String outputPath) {
|
||||
private static <I extends Result> void prepareResults(SparkSession spark, String bipScorePath, String outputPath) {
|
||||
|
||||
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
|
||||
|
||||
JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
|
||||
JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
|
||||
.textFile(bipScorePath)
|
||||
.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));
|
||||
.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));
|
||||
|
||||
Dataset<BipScore> bipScores = spark
|
||||
.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
|
||||
|
@ -124,20 +96,24 @@ public class SparkAtomicActionScoreJob implements Serializable {
|
|||
return bs;
|
||||
}).collect(Collectors.toList()).iterator()).rdd(), Encoders.bean(BipScore.class));
|
||||
|
||||
return bipScores.map((MapFunction<BipScore, Result>) bs -> {
|
||||
Result ret = new Result();
|
||||
bipScores
|
||||
|
||||
ret.setId(bs.getId());
|
||||
.map((MapFunction<BipScore, Result>) bs -> {
|
||||
Result ret = new Result();
|
||||
|
||||
ret.setMeasures(getMeasure(bs));
|
||||
ret.setId(bs.getId());
|
||||
|
||||
return ret;
|
||||
}, Encoders.bean(Result.class))
|
||||
ret.setMeasures(getMeasure(bs));
|
||||
|
||||
return ret;
|
||||
}, Encoders.bean(Result.class))
|
||||
.toJavaRDD()
|
||||
.map(p -> new AtomicAction(Result.class, p))
|
||||
.mapToPair(
|
||||
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
|
||||
new Text(OBJECT_MAPPER.writeValueAsString(aa))));
|
||||
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
|
||||
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
|
||||
|
||||
}
|
||||
|
||||
private static List<Measure> getMeasure(BipScore value) {
|
||||
|
@ -183,4 +159,12 @@ public class SparkAtomicActionScoreJob implements Serializable {
|
|||
HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
|
||||
}
|
||||
|
||||
public static <R> Dataset<R> readPath(
|
||||
SparkSession spark, String inputPath, Class<R> clazz) {
|
||||
return spark
|
||||
.read()
|
||||
.textFile(inputPath)
|
||||
.map((MapFunction<String, R>) value -> OBJECT_MAPPER.readValue(value, clazz), Encoders.bean(clazz));
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -1,21 +1,19 @@
|
|||
|
||||
package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;
|
||||
package eu.dnetlib.dhp.actionmanager.bipmodel;
|
||||
|
||||
import java.io.Serializable;
|
||||
import java.util.ArrayList;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.Score;
|
||||
|
||||
/**
|
||||
* Class that maps the model of the bipFinder! input data.
|
||||
* Only needed for deserialization purposes
|
||||
*/
|
||||
|
||||
public class BipResultModel extends HashMap<String, List<Score>> implements Serializable {
|
||||
public class BipDeserialize extends HashMap<String, List<Score>> implements Serializable {
|
||||
|
||||
public BipResultModel() {
|
||||
public BipDeserialize() {
|
||||
super();
|
||||
}
|
||||
|
|
@ -1,74 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;
|
||||
|
||||
import static eu.dnetlib.dhp.actionmanager.Constants.*;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Arrays;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
|
||||
import com.opencsv.bean.CsvBindByPosition;
|
||||
|
||||
import eu.dnetlib.dhp.schema.common.ModelConstants;
|
||||
import eu.dnetlib.dhp.schema.oaf.KeyValue;
|
||||
import eu.dnetlib.dhp.schema.oaf.Measure;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
|
||||
import lombok.AllArgsConstructor;
|
||||
import lombok.Getter;
|
||||
import lombok.NoArgsConstructor;
|
||||
import lombok.Setter;
|
||||
|
||||
@NoArgsConstructor
|
||||
@AllArgsConstructor
|
||||
@Getter
|
||||
@Setter
|
||||
public class BipProjectModel {
|
||||
String projectId;
|
||||
|
||||
String numOfInfluentialResults;
|
||||
|
||||
String numOfPopularResults;
|
||||
|
||||
String totalImpulse;
|
||||
|
||||
String totalCitationCount;
|
||||
|
||||
// each project bip measure has exactly one value, hence one key-value pair
|
||||
private Measure createMeasure(String measureId, String measureValue) {
|
||||
|
||||
KeyValue kv = new KeyValue();
|
||||
kv.setKey("score");
|
||||
kv.setValue(measureValue);
|
||||
kv
|
||||
.setDataInfo(
|
||||
OafMapperUtils
|
||||
.dataInfo(
|
||||
false,
|
||||
UPDATE_DATA_INFO_TYPE,
|
||||
true,
|
||||
false,
|
||||
OafMapperUtils
|
||||
.qualifier(
|
||||
UPDATE_MEASURE_BIP_CLASS_ID,
|
||||
UPDATE_CLASS_NAME,
|
||||
ModelConstants.DNET_PROVENANCE_ACTIONS,
|
||||
ModelConstants.DNET_PROVENANCE_ACTIONS),
|
||||
""));
|
||||
|
||||
Measure measure = new Measure();
|
||||
measure.setId(measureId);
|
||||
measure.setUnit(Collections.singletonList(kv));
|
||||
return measure;
|
||||
}
|
||||
|
||||
public List<Measure> toMeasures() {
|
||||
return Arrays
|
||||
.asList(
|
||||
createMeasure("numOfInfluentialResults", numOfInfluentialResults),
|
||||
createMeasure("numOfPopularResults", numOfPopularResults),
|
||||
createMeasure("totalImpulse", totalImpulse),
|
||||
createMeasure("totalCitationCount", totalCitationCount));
|
||||
}
|
||||
|
||||
}
|
|
@ -24,8 +24,8 @@ import org.slf4j.LoggerFactory;
|
|||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
|
||||
import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.common.HdfsSupport;
|
||||
import eu.dnetlib.dhp.schema.common.ModelConstants;
|
||||
|
@ -82,9 +82,9 @@ public class PrepareBipFinder implements Serializable {
|
|||
|
||||
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
|
||||
|
||||
JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
|
||||
JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
|
||||
.textFile(inputPath)
|
||||
.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));
|
||||
.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));
|
||||
|
||||
spark
|
||||
.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
|
||||
|
|
|
@ -6,6 +6,7 @@ import java.io.IOException;
|
|||
import java.io.OutputStreamWriter;
|
||||
import java.io.Serializable;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
|
@ -22,6 +23,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
|
|||
|
||||
import eu.dnetlib.dhp.actionmanager.project.PrepareProjects;
|
||||
import eu.dnetlib.dhp.actionmanager.project.utils.model.JsonTopic;
|
||||
import eu.dnetlib.dhp.actionmanager.project.utils.model.Project;
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
|
||||
/**
|
||||
|
|
|
@ -168,7 +168,7 @@ public class GenerateRorActionSetJob {
|
|||
|
||||
}
|
||||
|
||||
public static String calculateOpenaireId(final String rorId) {
|
||||
private static String calculateOpenaireId(final String rorId) {
|
||||
return String.format("20|%s::%s", Constants.ROR_NS_PREFIX, DHPUtils.md5(rorId));
|
||||
}
|
||||
|
||||
|
|
|
@ -75,7 +75,7 @@ public class SparkAtomicActionUsageJob implements Serializable {
|
|||
removeOutputDir(spark, outputPath);
|
||||
prepareData(dbname, spark, workingPath + "/usageDb", "usage_stats", "result_id");
|
||||
prepareData(dbname, spark, workingPath + "/projectDb", "project_stats", "id");
|
||||
prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repository_id");
|
||||
prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repositor_id");
|
||||
writeActionSet(spark, workingPath, outputPath);
|
||||
});
|
||||
}
|
||||
|
|
|
@ -1,20 +0,0 @@
|
|||
[
|
||||
{
|
||||
"paramName": "issm",
|
||||
"paramLongName": "isSparkSessionManaged",
|
||||
"paramDescription": "when true will stop SparkSession after job execution",
|
||||
"paramRequired": false
|
||||
},
|
||||
{
|
||||
"paramName": "ip",
|
||||
"paramLongName": "inputPath",
|
||||
"paramDescription": "the URL from where to get the programme file",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "o",
|
||||
"paramLongName": "outputPath",
|
||||
"paramDescription": "the path of the new ActionSet",
|
||||
"paramRequired": true
|
||||
}
|
||||
]
|
|
@ -1,35 +0,0 @@
|
|||
# --- You can override the following properties (if needed) coming from your ~/.dhp/application.properties ---
|
||||
# dhp.hadoop.frontend.temp.dir=/home/ilias.kanellos
|
||||
# dhp.hadoop.frontend.user.name=ilias.kanellos
|
||||
# dhp.hadoop.frontend.host.name=iis-cdh5-test-gw.ocean.icm.edu.pl
|
||||
# dhp.hadoop.frontend.port.ssh=22
|
||||
# oozieServiceLoc=http://iis-cdh5-test-m3:11000/oozie
|
||||
# jobTracker=yarnRM
|
||||
# nameNode=hdfs://nameservice1
|
||||
# oozie.execution.log.file.location = target/extract-and-run-on-remote-host.log
|
||||
# maven.executable=mvn
|
||||
|
||||
# Some memory and driver settings for more demanding tasks
|
||||
sparkDriverMemory=10G
|
||||
sparkExecutorMemory=10G
|
||||
sparkExecutorCores=4
|
||||
sparkShufflePartitions=7680
|
||||
|
||||
# The above is given differently in an example I found online
|
||||
oozie.action.sharelib.for.spark=spark2
|
||||
oozieActionShareLibForSpark2=spark2
|
||||
spark2YarnHistoryServerAddress=http://iis-cdh5-test-gw.ocean.icm.edu.pl:18089
|
||||
spark2EventLogDir=/user/spark/spark2ApplicationHistory
|
||||
sparkSqlWarehouseDir=/user/hive/warehouse
|
||||
hiveMetastoreUris=thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083
|
||||
# This MAY avoid the no library used error
|
||||
oozie.use.system.libpath=true
|
||||
# Some stuff copied from openaire's jobs
|
||||
spark2ExtraListeners=com.cloudera.spark.lineage.NavigatorAppListener
|
||||
spark2SqlQueryExecutionListeners=com.cloudera.spark.lineage.NavigatorQueryListener
|
||||
|
||||
# The following is needed as a property of a workflow
|
||||
oozie.wf.application.path=${oozieTopWfApplicationPath}
|
||||
|
||||
inputPath=/data/bip-affiliations/data.json
|
||||
outputPath=/tmp/crossref-affiliations-output-v5
|
|
@ -1,30 +0,0 @@
|
|||
<configuration>
|
||||
<property>
|
||||
<name>jobTracker</name>
|
||||
<value>yarnRM</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>nameNode</name>
|
||||
<value>hdfs://nameservice1</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>oozie.use.system.libpath</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>hiveMetastoreUris</name>
|
||||
<value>thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>hiveJdbcUrl</name>
|
||||
<value>jdbc:hive2://iis-cdh5-test-m3.ocean.icm.edu.pl:10000</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>hiveDbName</name>
|
||||
<value>openaire</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>oozie.launcher.mapreduce.user.classpath.first</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
</configuration>
|
|
@ -1,107 +0,0 @@
|
|||
<workflow-app name="BipAffiliations" xmlns="uri:oozie:workflow:0.5">
|
||||
<parameters>
|
||||
|
||||
<property>
|
||||
<name>inputPath</name>
|
||||
<description>the path where to find the inferred affiliation relations</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>outputPath</name>
|
||||
<description>the path where to store the actionset</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkDriverMemory</name>
|
||||
<description>memory for driver process</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkExecutorMemory</name>
|
||||
<description>memory for individual executor</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkExecutorCores</name>
|
||||
<description>number of cores used by single executor</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>oozieActionShareLibForSpark2</name>
|
||||
<description>oozie action sharelib for spark 2.*</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>spark2ExtraListeners</name>
|
||||
<value>com.cloudera.spark.lineage.NavigatorAppListener</value>
|
||||
<description>spark 2.* extra listeners classname</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>spark2SqlQueryExecutionListeners</name>
|
||||
<value>com.cloudera.spark.lineage.NavigatorQueryListener</value>
|
||||
<description>spark 2.* sql query execution listeners classname</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>spark2YarnHistoryServerAddress</name>
|
||||
<description>spark 2.* yarn history server address</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>spark2EventLogDir</name>
|
||||
<description>spark 2.* event log dir location</description>
|
||||
</property>
|
||||
</parameters>
|
||||
|
||||
<global>
|
||||
<job-tracker>${jobTracker}</job-tracker>
|
||||
<name-node>${nameNode}</name-node>
|
||||
<configuration>
|
||||
<property>
|
||||
<name>mapreduce.job.queuename</name>
|
||||
<value>${queueName}</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>oozie.launcher.mapred.job.queue.name</name>
|
||||
<value>${oozieLauncherQueueName}</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>oozie.action.sharelib.for.spark</name>
|
||||
<value>${oozieActionShareLibForSpark2}</value>
|
||||
</property>
|
||||
|
||||
</configuration>
|
||||
</global>
|
||||
<start to="deleteoutputpath"/>
|
||||
<kill name="Kill">
|
||||
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
|
||||
</kill>
|
||||
<action name="deleteoutputpath">
|
||||
<fs>
|
||||
<delete path="${outputPath}"/>
|
||||
<mkdir path="${outputPath}"/>
|
||||
<delete path="${workingDir}"/>
|
||||
<mkdir path="${workingDir}"/>
|
||||
</fs>
|
||||
<ok to="atomicactions"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="atomicactions">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Produces the atomic action with the inferred by BIP! affiliation relations from Crossref</name>
|
||||
<class>eu.dnetlib.dhp.actionmanager.bipaffiliations.PrepareAffiliationRelations</class>
|
||||
<jar>dhp-aggregation-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.warehouse.dir=${sparkSqlWarehouseDir}
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${inputPath}</arg>
|
||||
<arg>--outputPath</arg><arg>${outputPath}</arg>
|
||||
</spark>
|
||||
<ok to="End"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<end name="End"/>
|
||||
</workflow-app>
|
|
@ -6,15 +6,9 @@
|
|||
"paramRequired": false
|
||||
},
|
||||
{
|
||||
"paramName": "rip",
|
||||
"paramLongName": "resultsInputPath",
|
||||
"paramDescription": "the URL from where to get the input file for results",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "pip",
|
||||
"paramLongName": "projectsInputPath",
|
||||
"paramDescription": "the URL from where to get the input file for projects",
|
||||
"paramName": "ip",
|
||||
"paramLongName": "inputPath",
|
||||
"paramDescription": "the URL from where to get the programme file",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
|
|
|
@ -1,9 +1,4 @@
|
|||
{
|
||||
"ETHZ.UNIGENF": {
|
||||
"openaire_id": "opendoar____::1400",
|
||||
"datacite_name": "Uni Genf",
|
||||
"official_name": "Archive ouverte UNIGE"
|
||||
},
|
||||
"GESIS.RKI": {
|
||||
"openaire_id": "re3data_____::r3d100010436",
|
||||
"datacite_name": "Forschungsdatenzentrum am Robert Koch Institut",
|
||||
|
|
|
@ -222,7 +222,7 @@ object BioDBToOAF {
|
|||
def uniprotToOAF(input: String): List[Oaf] = {
|
||||
implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats
|
||||
lazy val json = parse(input)
|
||||
val pid = (json \ "pid").extract[String].trim()
|
||||
val pid = (json \ "pid").extract[String]
|
||||
|
||||
val d = new Dataset
|
||||
|
||||
|
|
|
@ -18,9 +18,9 @@ import org.apache.spark.sql.expressions.Aggregator
|
|||
import org.apache.spark.sql._
|
||||
import org.slf4j.{Logger, LoggerFactory}
|
||||
|
||||
import java.io.InputStream
|
||||
import java.io.{ByteArrayInputStream, InputStream}
|
||||
import scala.io.Source
|
||||
import scala.xml.pull.XMLEventReader
|
||||
//import scala.xml.pull.XMLEventReader
|
||||
|
||||
object SparkCreateBaselineDataFrame {
|
||||
|
||||
|
@ -197,8 +197,8 @@ object SparkCreateBaselineDataFrame {
|
|||
val ds: Dataset[PMArticle] = spark.createDataset(
|
||||
k.filter(i => i._1.endsWith(".gz"))
|
||||
.flatMap(i => {
|
||||
val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
|
||||
new PMParser(xml)
|
||||
// val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
|
||||
new PMParser(new ByteArrayInputStream(i._2.getBytes()))
|
||||
})
|
||||
)
|
||||
ds.map(p => (p.getPmid, p))(Encoders.tuple(Encoders.STRING, PMEncoder))
|
||||
|
|
|
@ -1,11 +1,20 @@
|
|||
package eu.dnetlib.dhp.sx.bio.pubmed
|
||||
|
||||
import javax.xml.stream.{XMLEventReader, XMLInputFactory, XMLStreamConstants}
|
||||
import scala.language.postfixOps
|
||||
import scala.xml.MetaData
|
||||
import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}
|
||||
//import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}
|
||||
|
||||
/** @param xml
|
||||
*/
|
||||
class PMParser(xml: XMLEventReader) extends Iterator[PMArticle] {
|
||||
class PMParser(stream: java.io.InputStream) extends Iterator[PMArticle] {
|
||||
|
||||
private val reader: XMLEventReader = {
|
||||
println("INSTANTIATE READER")
|
||||
val factory = XMLInputFactory.newInstance()
|
||||
factory.createXMLEventReader(stream)
|
||||
|
||||
}
|
||||
|
||||
var currentArticle: PMArticle = generateNextArticle()
|
||||
|
||||
|
@ -49,85 +58,142 @@ class PMParser(xml: XMLEventReader) extends Iterator[PMArticle] {
|
|||
var currentMonth = "01"
|
||||
var currentDay = "01"
|
||||
var currentArticleType: String = null
|
||||
var sb = new StringBuilder()
|
||||
var insideChar = false
|
||||
var complete = false
|
||||
while (reader.hasNext && !complete) {
|
||||
|
||||
while (xml.hasNext) {
|
||||
xml.next match {
|
||||
case EvElemStart(_, label, attrs, _) =>
|
||||
currNode = label
|
||||
val next = reader.nextEvent()
|
||||
|
||||
label match {
|
||||
case "PubmedArticle" => currentArticle = new PMArticle
|
||||
case "Author" => currentAuthor = new PMAuthor
|
||||
case "Journal" => currentJournal = new PMJournal
|
||||
case "Grant" => currentGrant = new PMGrant
|
||||
case "PublicationType" | "DescriptorName" =>
|
||||
currentSubject = new PMSubject
|
||||
currentSubject.setMeshId(extractAttributes(attrs, "UI"))
|
||||
case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
|
||||
case _ =>
|
||||
}
|
||||
case EvElemEnd(_, label) =>
|
||||
label match {
|
||||
case "PubmedArticle" => return currentArticle
|
||||
case "Author" => currentArticle.getAuthors.add(currentAuthor)
|
||||
case "Journal" => currentArticle.setJournal(currentJournal)
|
||||
case "Grant" => currentArticle.getGrants.add(currentGrant)
|
||||
case "PubMedPubDate" =>
|
||||
if (currentArticle.getDate == null)
|
||||
currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
|
||||
case "PubDate" => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
|
||||
case "DescriptorName" => currentArticle.getSubjects.add(currentSubject)
|
||||
case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
|
||||
case _ =>
|
||||
}
|
||||
case EvText(text) =>
|
||||
if (currNode != null && text.trim.nonEmpty)
|
||||
currNode match {
|
||||
case "ArticleTitle" => {
|
||||
if (currentArticle.getTitle == null)
|
||||
currentArticle.setTitle(text.trim)
|
||||
else
|
||||
currentArticle.setTitle(currentArticle.getTitle + text.trim)
|
||||
}
|
||||
case "AbstractText" => {
|
||||
if (currentArticle.getDescription == null)
|
||||
currentArticle.setDescription(text.trim)
|
||||
else
|
||||
currentArticle.setDescription(currentArticle.getDescription + text.trim)
|
||||
}
|
||||
case "PMID" => currentArticle.setPmid(text.trim)
|
||||
case "ArticleId" =>
|
||||
if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
|
||||
if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
|
||||
case "Language" => currentArticle.setLanguage(text.trim)
|
||||
case "ISSN" => currentJournal.setIssn(text.trim)
|
||||
case "GrantID" => currentGrant.setGrantID(text.trim)
|
||||
case "Agency" => currentGrant.setAgency(text.trim)
|
||||
case "Country" => if (currentGrant != null) currentGrant.setCountry(text.trim)
|
||||
case "Year" => currentYear = text.trim
|
||||
case "Month" => currentMonth = text.trim
|
||||
case "Day" => currentDay = text.trim
|
||||
case "Volume" => currentJournal.setVolume(text.trim)
|
||||
case "Issue" => currentJournal.setIssue(text.trim)
|
||||
case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
|
||||
case "LastName" => {
|
||||
if (currentAuthor != null)
|
||||
currentAuthor.setLastName(text.trim)
|
||||
}
|
||||
case "ForeName" =>
|
||||
if (currentAuthor != null)
|
||||
currentAuthor.setForeName(text.trim)
|
||||
case "Title" =>
|
||||
if (currentJournal.getTitle == null)
|
||||
currentJournal.setTitle(text.trim)
|
||||
else
|
||||
currentJournal.setTitle(currentJournal.getTitle + text.trim)
|
||||
case _ =>
|
||||
if (next.isStartElement) {
|
||||
if (insideChar) {
|
||||
if (sb.nonEmpty)
|
||||
println(s"got data ${sb.toString.trim}")
|
||||
insideChar = false
|
||||
}
|
||||
val name = next.asStartElement().getName.getLocalPart
|
||||
println(s"Start Element $name")
|
||||
next.asStartElement().getAttributes.forEachRemaining(e => print(e.toString))
|
||||
|
||||
}
|
||||
case _ =>
|
||||
} else if (next.isEndElement) {
|
||||
if (insideChar) {
|
||||
if (sb.nonEmpty)
|
||||
println(s"got data ${sb.toString.trim}")
|
||||
insideChar = false
|
||||
}
|
||||
val name = next.asEndElement().getName.getLocalPart
|
||||
println(s"End Element $name")
|
||||
if (name.equalsIgnoreCase("PubmedArticle")) {
|
||||
complete = true
|
||||
println("Condizione di uscita")
|
||||
}
|
||||
|
||||
} else if (next.isCharacters) {
|
||||
if (!insideChar) {
|
||||
insideChar = true
|
||||
sb.clear()
|
||||
}
|
||||
val d = next.asCharacters().getData
|
||||
if (d.trim.nonEmpty)
|
||||
sb.append(d.trim)
|
||||
}
|
||||
|
||||
// next match {
|
||||
// case _ if (next.isStartElement) =>
|
||||
// val name = next.asStartElement().getName.getLocalPart
|
||||
// println(s"Start Element $name")
|
||||
// case _ if (next.isEndElement) =>
|
||||
// val name = next.asStartElement().getName.getLocalPart
|
||||
// println(s"End Element $name")
|
||||
// case _ if (next.isCharacters) =>
|
||||
// val c = next.asCharacters()
|
||||
// val data = c.getData
|
||||
// println(s"Text value $data")
|
||||
//
|
||||
// }
|
||||
|
||||
//
|
||||
//
|
||||
// reader.next match {
|
||||
//
|
||||
// case
|
||||
//
|
||||
// case EvElemStart(_, label, attrs, _) =>
|
||||
// currNode = label
|
||||
//
|
||||
// label match {
|
||||
// case "PubmedArticle" => currentArticle = new PMArticle
|
||||
// case "Author" => currentAuthor = new PMAuthor
|
||||
// case "Journal" => currentJournal = new PMJournal
|
||||
// case "Grant" => currentGrant = new PMGrant
|
||||
// case "PublicationType" | "DescriptorName" =>
|
||||
// currentSubject = new PMSubject
|
||||
// currentSubject.setMeshId(extractAttributes(attrs, "UI"))
|
||||
// case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
|
||||
// case _ =>
|
||||
// }
|
||||
// case EvElemEnd(_, label) =>
|
||||
// label match {
|
||||
// case "PubmedArticle" => return currentArticle
|
||||
// case "Author" => currentArticle.getAuthors.add(currentAuthor)
|
||||
// case "Journal" => currentArticle.setJournal(currentJournal)
|
||||
// case "Grant" => currentArticle.getGrants.add(currentGrant)
|
||||
// case "PubMedPubDate" =>
|
||||
// if (currentArticle.getDate == null)
|
||||
// currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
|
||||
// case "PubDate" => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
|
||||
// case "DescriptorName" => currentArticle.getSubjects.add(currentSubject)
|
||||
// case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
|
||||
// case _ =>
|
||||
// }
|
||||
// case EvText(text) =>
|
||||
// if (currNode != null && text.trim.nonEmpty)
|
||||
// currNode match {
|
||||
// case "ArticleTitle" => {
|
||||
// if (currentArticle.getTitle == null)
|
||||
// currentArticle.setTitle(text.trim)
|
||||
// else
|
||||
// currentArticle.setTitle(currentArticle.getTitle + text.trim)
|
||||
// }
|
||||
// case "AbstractText" => {
|
||||
// if (currentArticle.getDescription == null)
|
||||
// currentArticle.setDescription(text.trim)
|
||||
// else
|
||||
// currentArticle.setDescription(currentArticle.getDescription + text.trim)
|
||||
// }
|
||||
// case "PMID" => currentArticle.setPmid(text.trim)
|
||||
// case "ArticleId" =>
|
||||
// if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
|
||||
// if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
|
||||
// case "Language" => currentArticle.setLanguage(text.trim)
|
||||
// case "ISSN" => currentJournal.setIssn(text.trim)
|
||||
// case "GrantID" => currentGrant.setGrantID(text.trim)
|
||||
// case "Agency" => currentGrant.setAgency(text.trim)
|
||||
// case "Country" => if (currentGrant != null) currentGrant.setCountry(text.trim)
|
||||
// case "Year" => currentYear = text.trim
|
||||
// case "Month" => currentMonth = text.trim
|
||||
// case "Day" => currentDay = text.trim
|
||||
// case "Volume" => currentJournal.setVolume(text.trim)
|
||||
// case "Issue" => currentJournal.setIssue(text.trim)
|
||||
// case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
|
||||
// case "LastName" => {
|
||||
// if (currentAuthor != null)
|
||||
// currentAuthor.setLastName(text.trim)
|
||||
// }
|
||||
// case "ForeName" =>
|
||||
// if (currentAuthor != null)
|
||||
// currentAuthor.setForeName(text.trim)
|
||||
// case "Title" =>
|
||||
// if (currentJournal.getTitle == null)
|
||||
// currentJournal.setTitle(text.trim)
|
||||
// else
|
||||
// currentJournal.setTitle(currentJournal.getTitle + text.trim)
|
||||
// case _ =>
|
||||
//
|
||||
// }
|
||||
// case _ =>
|
||||
// }
|
||||
|
||||
}
|
||||
null
|
||||
}
|
||||
|
|
|
@ -1,145 +0,0 @@
|
|||
|
||||
package eu.dnetlib.dhp.actionmanager.bipaffiliations;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.*;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.apache.commons.io.FileUtils;
|
||||
import org.apache.hadoop.io.Text;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.JavaRDD;
|
||||
import org.apache.spark.api.java.JavaSparkContext;
|
||||
import org.apache.spark.sql.Dataset;
|
||||
import org.apache.spark.sql.Encoders;
|
||||
import org.apache.spark.sql.Row;
|
||||
import org.apache.spark.sql.SparkSession;
|
||||
import org.junit.jupiter.api.AfterAll;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.schema.action.AtomicAction;
|
||||
import eu.dnetlib.dhp.schema.common.ModelConstants;
|
||||
import eu.dnetlib.dhp.schema.oaf.Relation;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
|
||||
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
|
||||
|
||||
public class PrepareAffiliationRelationsTest {
|
||||
|
||||
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
|
||||
|
||||
private static SparkSession spark;
|
||||
|
||||
private static Path workingDir;
|
||||
private static final String ID_PREFIX = "50|doi_________::";
|
||||
private static final Logger log = LoggerFactory
|
||||
.getLogger(PrepareAffiliationRelationsTest.class);
|
||||
|
||||
@BeforeAll
|
||||
public static void beforeAll() throws IOException {
|
||||
workingDir = Files.createTempDirectory(PrepareAffiliationRelationsTest.class.getSimpleName());
|
||||
|
||||
log.info("Using work dir {}", workingDir);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
conf.setAppName(PrepareAffiliationRelationsTest.class.getSimpleName());
|
||||
|
||||
conf.setMaster("local[*]");
|
||||
conf.set("spark.driver.host", "localhost");
|
||||
conf.set("hive.metastore.local", "true");
|
||||
conf.set("spark.ui.enabled", "false");
|
||||
conf.set("spark.sql.warehouse.dir", workingDir.toString());
|
||||
conf.set("hive.metastore.warehouse.dir", workingDir.resolve("warehouse").toString());
|
||||
|
||||
spark = SparkSession
|
||||
.builder()
|
||||
.appName(PrepareAffiliationRelationsTest.class.getSimpleName())
|
||||
.config(conf)
|
||||
.getOrCreate();
|
||||
}
|
||||
|
||||
@AfterAll
|
||||
public static void afterAll() throws IOException {
|
||||
FileUtils.deleteDirectory(workingDir.toFile());
|
||||
spark.stop();
|
||||
}
|
||||
|
||||
@Test
|
||||
void testMatch() throws Exception {
|
||||
|
||||
String affiliationRelationsPath = getClass()
|
||||
.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json")
|
||||
.getPath();
|
||||
|
||||
String outputPath = workingDir.toString() + "/actionSet";
|
||||
|
||||
PrepareAffiliationRelations
|
||||
.main(
|
||||
new String[] {
|
||||
"-isSparkSessionManaged", Boolean.FALSE.toString(),
|
||||
"-inputPath", affiliationRelationsPath,
|
||||
"-outputPath", outputPath
|
||||
});
|
||||
|
||||
final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
|
||||
|
||||
JavaRDD<Relation> tmp = sc
|
||||
.sequenceFile(outputPath, Text.class, Text.class)
|
||||
.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
|
||||
.map(aa -> ((Relation) aa.getPayload()));
|
||||
|
||||
// for (Relation r : tmp.collect()) {
|
||||
// System.out.println(
|
||||
// r.getSource() + "\t" + r.getTarget() + "\t" + r.getRelType() + "\t" + r.getRelClass() + "\t" + r.getSubRelType() + "\t" + r.getValidationDate() + "\t" + r.getDataInfo().getTrust() + "\t" + r.getDataInfo().getInferred()
|
||||
// );
|
||||
// }
|
||||
// count the number of relations
|
||||
assertEquals(20, tmp.count());
|
||||
|
||||
Dataset<Relation> dataset = spark.createDataset(tmp.rdd(), Encoders.bean(Relation.class));
|
||||
dataset.createOrReplaceTempView("result");
|
||||
|
||||
Dataset<Row> execVerification = spark
|
||||
.sql("select r.relType, r.relClass, r.source, r.target, r.dataInfo.trust from result r");
|
||||
|
||||
// verify that we have equal number of bi-directional relations
|
||||
Assertions
|
||||
.assertEquals(
|
||||
10, execVerification
|
||||
.filter(
|
||||
"relClass='" + ModelConstants.HAS_AUTHOR_INSTITUTION + "'")
|
||||
.collectAsList()
|
||||
.size());
|
||||
|
||||
Assertions
|
||||
.assertEquals(
|
||||
10, execVerification
|
||||
.filter(
|
||||
"relClass='" + ModelConstants.IS_AUTHOR_INSTITUTION_OF + "'")
|
||||
.collectAsList()
|
||||
.size());
|
||||
|
||||
// check confidence value of a specific relation
|
||||
String sourceDOI = "10.1061/(asce)0733-9399(2002)128:7(759)";
|
||||
|
||||
final String sourceOpenaireId = ID_PREFIX
|
||||
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", sourceDOI));
|
||||
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"0.7071067812", execVerification
|
||||
.filter(
|
||||
"source='" + sourceOpenaireId + "'")
|
||||
.collectAsList()
|
||||
.get(0)
|
||||
.getString(4));
|
||||
|
||||
}
|
||||
}
|
|
@ -6,8 +6,7 @@ import static org.junit.jupiter.api.Assertions.*;
|
|||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import javax.xml.crypto.Data;
|
||||
import java.util.List;
|
||||
|
||||
import org.apache.commons.io.FileUtils;
|
||||
import org.apache.hadoop.io.Text;
|
||||
|
@ -28,9 +27,7 @@ import org.slf4j.LoggerFactory;
|
|||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import eu.dnetlib.dhp.schema.action.AtomicAction;
|
||||
import eu.dnetlib.dhp.schema.oaf.KeyValue;
|
||||
import eu.dnetlib.dhp.schema.oaf.OafEntity;
|
||||
import eu.dnetlib.dhp.schema.oaf.Project;
|
||||
import eu.dnetlib.dhp.schema.oaf.Publication;
|
||||
import eu.dnetlib.dhp.schema.oaf.Result;
|
||||
|
||||
public class SparkAtomicActionScoreJobTest {
|
||||
|
@ -40,8 +37,8 @@ public class SparkAtomicActionScoreJobTest {
|
|||
private static SparkSession spark;
|
||||
|
||||
private static Path workingDir;
|
||||
|
||||
private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJobTest.class);
|
||||
private static final Logger log = LoggerFactory
|
||||
.getLogger(SparkAtomicActionScoreJobTest.class);
|
||||
|
||||
@BeforeAll
|
||||
public static void beforeAll() throws IOException {
|
||||
|
@ -72,64 +69,47 @@ public class SparkAtomicActionScoreJobTest {
|
|||
spark.stop();
|
||||
}
|
||||
|
||||
private void runJob(String resultsInputPath, String projectsInputPath, String outputPath) throws Exception {
|
||||
@Test
|
||||
void testMatch() throws Exception {
|
||||
String bipScoresPath = getClass()
|
||||
.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/bip_scores_oid.json")
|
||||
.getPath();
|
||||
|
||||
SparkAtomicActionScoreJob
|
||||
.main(
|
||||
new String[] {
|
||||
"-isSparkSessionManaged", Boolean.FALSE.toString(),
|
||||
"-resultsInputPath", resultsInputPath,
|
||||
"-projectsInputPath", projectsInputPath,
|
||||
"-outputPath", outputPath,
|
||||
"-isSparkSessionManaged",
|
||||
Boolean.FALSE.toString(),
|
||||
"-inputPath",
|
||||
|
||||
bipScoresPath,
|
||||
|
||||
"-outputPath",
|
||||
workingDir.toString() + "/actionSet"
|
||||
});
|
||||
}
|
||||
|
||||
@Test
|
||||
void testScores() throws Exception {
|
||||
|
||||
String resultsInputPath = getClass()
|
||||
.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json")
|
||||
.getPath();
|
||||
|
||||
String projectsInputPath = getClass()
|
||||
.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json")
|
||||
.getPath();
|
||||
|
||||
String outputPath = workingDir.toString() + "/actionSet";
|
||||
|
||||
// execute the job to generate the action sets for result scores
|
||||
runJob(resultsInputPath, projectsInputPath, outputPath);
|
||||
|
||||
final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
|
||||
|
||||
JavaRDD<OafEntity> tmp = sc
|
||||
.sequenceFile(outputPath, Text.class, Text.class)
|
||||
JavaRDD<Result> tmp = sc
|
||||
.sequenceFile(workingDir.toString() + "/actionSet", Text.class, Text.class)
|
||||
.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
|
||||
.map(aa -> ((OafEntity) aa.getPayload()));
|
||||
.map(aa -> ((Result) aa.getPayload()));
|
||||
|
||||
assertEquals(8, tmp.count());
|
||||
assertEquals(4, tmp.count());
|
||||
|
||||
Dataset<OafEntity> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(OafEntity.class));
|
||||
Dataset<Result> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(Result.class));
|
||||
verificationDataset.createOrReplaceTempView("result");
|
||||
|
||||
Dataset<Row> testDataset = spark
|
||||
Dataset<Row> execVerification = spark
|
||||
.sql(
|
||||
"Select p.id oaid, mes.id, mUnit.value from result p " +
|
||||
"lateral view explode(measures) m as mes " +
|
||||
"lateral view explode(mes.unit) u as mUnit ");
|
||||
|
||||
// execVerification.show();
|
||||
|
||||
Assertions.assertEquals(28, testDataset.count());
|
||||
|
||||
assertResultImpactScores(testDataset);
|
||||
assertProjectImpactScores(testDataset);
|
||||
|
||||
}
|
||||
|
||||
void assertResultImpactScores(Dataset<Row> testDataset) {
|
||||
Assertions.assertEquals(12, execVerification.count());
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"6.63451994567e-09", testDataset
|
||||
"6.63451994567e-09", execVerification
|
||||
.filter(
|
||||
"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
|
||||
"and id = 'influence'")
|
||||
|
@ -139,7 +119,7 @@ public class SparkAtomicActionScoreJobTest {
|
|||
.getString(0));
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"0.348694533145", testDataset
|
||||
"0.348694533145", execVerification
|
||||
.filter(
|
||||
"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
|
||||
"and id = 'popularity_alt'")
|
||||
|
@ -149,7 +129,7 @@ public class SparkAtomicActionScoreJobTest {
|
|||
.getString(0));
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"2.16094680115e-09", testDataset
|
||||
"2.16094680115e-09", execVerification
|
||||
.filter(
|
||||
"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
|
||||
"and id = 'popularity'")
|
||||
|
@ -157,49 +137,7 @@ public class SparkAtomicActionScoreJobTest {
|
|||
.collectAsList()
|
||||
.get(0)
|
||||
.getString(0));
|
||||
|
||||
}
|
||||
|
||||
void assertProjectImpactScores(Dataset<Row> testDataset) throws Exception {
|
||||
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"0", testDataset
|
||||
.filter(
|
||||
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
|
||||
"and id = 'numOfInfluentialResults'")
|
||||
.select("value")
|
||||
.collectAsList()
|
||||
.get(0)
|
||||
.getString(0));
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"1", testDataset
|
||||
.filter(
|
||||
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
|
||||
"and id = 'numOfPopularResults'")
|
||||
.select("value")
|
||||
.collectAsList()
|
||||
.get(0)
|
||||
.getString(0));
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"25", testDataset
|
||||
.filter(
|
||||
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
|
||||
"and id = 'totalImpulse'")
|
||||
.select("value")
|
||||
.collectAsList()
|
||||
.get(0)
|
||||
.getString(0));
|
||||
Assertions
|
||||
.assertEquals(
|
||||
"43", testDataset
|
||||
.filter(
|
||||
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
|
||||
"and id = 'totalCitationCount'")
|
||||
.select("value")
|
||||
.collectAsList()
|
||||
.get(0)
|
||||
.getString(0));
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,7 +0,0 @@
|
|||
{"DOI":"10.1061\/(asce)0733-9399(2002)128:7(759)","Matchings":[{"RORid":"https:\/\/ror.org\/03yxnpp24","Confidence":0.7071067812},{"RORid":"https:\/\/ror.org\/01teme464","Confidence":0.89}]}
|
||||
{"DOI":"10.1105\/tpc.8.3.343","Matchings":[{"RORid":"https:\/\/ror.org\/02k40bc56","Confidence":0.7071067812}]}
|
||||
{"DOI":"10.1161\/01.cir.0000013305.01850.37","Matchings":[{"RORid":"https:\/\/ror.org\/00qjgza05","Confidence":1}]}
|
||||
{"DOI":"10.1142\/s021821650200186x","Matchings":[{"RORid":"https:\/\/ror.org\/035xkbk20","Confidence":1},{"RORid":"https:\/\/ror.org\/05apxxy63","Confidence":1}]}
|
||||
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(575)","Matchings":[{"RORid":"https:\/\/ror.org\/04j198w64","Confidence":0.82}]}
|
||||
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(588)","Matchings":[{"RORid":"https:\/\/ror.org\/03m8km719","Confidence":0.8660254038},{"RORid":"https:\/\/ror.org\/02aze4h65","Confidence":0.87}]}
|
||||
{"DOI":"10.1161\/hy0202.103001","Matchings":[{"RORid":"https:\/\/ror.org\/057xtrt18","Confidence":0.7071067812}]}
|
|
@ -1,4 +0,0 @@
|
|||
{"projectId":"40|nsf_________::d93e50d22374a1cf59f6a232413ea027","numOfInfluentialResults":0,"numOfPopularResults":10,"totalImpulse":181,"totalCitationCount":235}
|
||||
{"projectId":"40|nih_________::1c93debc7085e440f245fbe70b2e8b21","numOfInfluentialResults":14,"numOfPopularResults":17,"totalImpulse":1558,"totalCitationCount":4226}
|
||||
{"projectId":"40|nih_________::c02a8233e9b60f05bb418f0c9b714833","numOfInfluentialResults":0,"numOfPopularResults":1,"totalImpulse":25,"totalCitationCount":43}
|
||||
{"projectId":"40|corda_______::d91dcf3a87dd7f72248fab0b8a4ba273","numOfInfluentialResults":2,"numOfPopularResults":3,"totalImpulse":78,"totalCitationCount":178}
|
|
@ -1,44 +1,15 @@
|
|||
{"classification": "Signaling protein", "pdb": "5NM4", "deposition_date": "2017-04-05", "title": "A2a adenosine receptor room-temperature structure determined by serial Femtosecond crystallography", "Keywords": ["Oom-temperature", " serial crystallography", " signaling protein"], "authors": ["T.weinert", "R.cheng", "D.james", "D.gashi", "P.nogly", "K.jaeger", "M.hennig", "", "J.standfuss"], "pmid": "28912485", "doi": "10.1038/S41467-017-00630-4"}
|
||||
{"classification": "Oxidoreductase/oxidoreductase inhibitor", "pdb": "4KN3", "deposition_date": "2013-05-08", "title": "Structure of the y34ns91g double mutant of dehaloperoxidase from Amphitrite ornata with 2,4,6-trichlorophenol", "Keywords": ["Lobin", " oxygen storage", " peroxidase", " oxidoreductase", " oxidoreductase-", "Oxidoreductase inhibitor complex"], "authors": ["C.wang", "L.lovelace", "L.lebioda"], "pmid": "23952341", "doi": "10.1021/BI400627W"}
|
||||
{"classification": "Transport protein", "pdb": "8HKM", "deposition_date": "2022-11-27", "title": "Ion channel", "Keywords": ["On channel", " transport protein"], "authors": ["D.h.jiang", "J.t.zhang"], "pmid": "37494189", "doi": "10.1016/J.CELREP.2023.112858"}
|
||||
{"classification": "Signaling protein", "pdb": "6JT1", "deposition_date": "2019-04-08", "title": "Structure of human soluble guanylate cyclase in the heme oxidised State", "Keywords": ["Oluble guanylate cyclase", " signaling protein"], "authors": ["L.chen", "Y.kang", "R.liu", "J.-x.wu"], "pmid": "31514202", "doi": "10.1038/S41586-019-1584-6"}
|
||||
{"classification": "Immune system", "pdb": "7OW6", "deposition_date": "2021-06-16", "title": "Crystal structure of a tcr in complex with hla-a*11:01 bound to kras G12d peptide (vvvgadgvgk)", "Keywords": ["La", " kras", " tcr", " immune system"], "authors": ["V.karuppiah", "R.a.robinson"], "doi": "10.1038/S41467-022-32811-1"}
|
||||
{"classification": "Biosynthetic protein", "pdb": "5EQ8", "deposition_date": "2015-11-12", "title": "Crystal structure of medicago truncatula histidinol-phosphate Phosphatase (mthpp) in complex with l-histidinol", "Keywords": ["Istidine biosynthesis", " metabolic pathways", " dimer", " plant", "", "Biosynthetic protein"], "authors": ["M.ruszkowski", "Z.dauter"], "pmid": "26994138", "doi": "10.1074/JBC.M115.708727"}
|
||||
{"classification": "De novo protein", "pdb": "8CWA", "deposition_date": "2022-05-18", "title": "Solution nmr structure of 8-residue rosetta-designed cyclic peptide D8.21 in cdcl3 with cis/trans switching (tc conformation, 53%)", "Keywords": ["Yclic peptide", " non natural amino acids", " cis/trans", " switch peptides", "", "De novo design", "Membrane permeability", "De novo protein"], "authors": ["T.a.ramelot", "R.tejero", "G.t.montelione"], "pmid": "36041435", "doi": "10.1016/J.CELL.2022.07.019"}
|
||||
{"classification": "Hydrolase", "pdb": "3R6M", "deposition_date": "2011-03-21", "title": "Crystal structure of vibrio parahaemolyticus yeaz", "Keywords": ["Ctin/hsp70 nucleotide-binding fold", " bacterial resuscitation", " viable", "But non-culturable state", "Resuscitation promoting factor", "Ygjd", "", "Yjee", "Vibrio parahaemolyticus", "Hydrolase"], "authors": ["A.roujeinikova", "I.aydin"], "pmid": "21858042", "doi": "10.1371/JOURNAL.PONE.0023245"}
|
||||
{"classification": "Hydrolase", "pdb": "2W5J", "deposition_date": "2008-12-10", "title": "Structure of the c14-rotor ring of the proton translocating Chloroplast atp synthase", "Keywords": ["Ydrolase", " chloroplast", " atp synthase", " lipid-binding", " cf(0)", " membrane", "", "Transport", "Formylation", "Energy transduction", "Hydrogen ion transport", "", "Ion transport", "Transmembrane", "Membrane protein"], "authors": ["M.vollmar", "D.schlieper", "M.winn", "C.buechner", "G.groth"], "pmid": "19423706", "doi": "10.1074/JBC.M109.006916"}
|
||||
{"classification": "De novo protein", "pdb": "4GLU", "deposition_date": "2012-08-14", "title": "Crystal structure of the mirror image form of vegf-a", "Keywords": ["-protein", " covalent dimer", " cysteine knot protein", " growth factor", " de", "Novo protein"], "authors": ["K.mandal", "M.uppalapati", "D.ault-riche", "J.kenney", "J.lowitz", "S.sidhu", "", "S.b.h.kent"], "pmid": "22927390", "doi": "10.1073/PNAS.1210483109"}
|
||||
{"classification": "Hydrolase/hydrolase inhibitor", "pdb": "3WYL", "deposition_date": "2014-09-01", "title": "Crystal structure of the catalytic domain of pde10a complexed with 5- Methoxy-3-(1-phenyl-1h-pyrazol-5-yl)-1-(3-(trifluoromethyl)phenyl) Pyridazin-4(1h)-one", "Keywords": ["Ydrolase-hydrolase inhibitor complex"], "authors": ["H.oki", "Y.hayano"], "pmid": "25384088", "doi": "10.1021/JM5013648"}
|
||||
{"classification": "Isomerase", "pdb": "5BOR", "deposition_date": "2015-05-27", "title": "Structure of acetobacter aceti pure-s57c, sulfonate form", "Keywords": ["Cidophile", " pure", " purine biosynthesis", " isomerase"], "authors": ["K.l.sullivan", "T.j.kappock"]}
|
||||
{"classification": "Hydrolase", "pdb": "1X0C", "deposition_date": "2005-03-17", "title": "Improved crystal structure of isopullulanase from aspergillus niger Atcc 9642", "Keywords": ["Ullulan", " glycoside hydrolase family 49", " glycoprotein", " hydrolase"], "authors": ["M.mizuno", "T.tonozuka", "A.yamamura", "Y.miyasaka", "H.akeboshi", "S.kamitori", "", "A.nishikawa", "Y.sakano"], "pmid": "18155243", "doi": "10.1016/J.JMB.2007.11.098"}
|
||||
{"classification": "Oxidoreductase", "pdb": "7CUP", "deposition_date": "2020-08-23", "title": "Structure of 2,5-dihydroxypridine dioxygenase from pseudomonas putida Kt2440", "Keywords": ["On-heme dioxygenase", " oxidoreductase"], "authors": ["G.q.liu", "H.z.tang"]}
|
||||
{"classification": "Ligase", "pdb": "1VCN", "deposition_date": "2004-03-10", "title": "Crystal structure of t.th. hb8 ctp synthetase complex with sulfate Anion", "Keywords": ["Etramer", " riken structural genomics/proteomics initiative", " rsgi", "", "Structural genomics", "Ligase"], "authors": ["M.goto", "Riken structural genomics/proteomics initiative (rsgi)"], "pmid": "15296735", "doi": "10.1016/J.STR.2004.05.013"}
|
||||
{"classification": "Transferase/transferase inhibitor", "pdb": "6C9V", "deposition_date": "2018-01-28", "title": "Mycobacterium tuberculosis adenosine kinase bound to (2r,3s,4r,5r)-2- (hydroxymethyl)-5-(6-(4-phenylpiperazin-1-yl)-9h-purin-9-yl) Tetrahydrofuran-3,4-diol", "Keywords": ["Ucleoside analog", " complex", " inhibitor", " structural genomics", " psi-2", "", "Protein structure initiative", "Tb structural genomics consortium", "", "Tbsgc", "Transferase-transferase inhibitor complex"], "authors": ["R.a.crespo", "Tb structural genomics consortium (tbsgc)"], "pmid": "31002508", "doi": "10.1021/ACS.JMEDCHEM.9B00020"}
|
||||
{"classification": "De novo protein", "pdb": "4LPY", "deposition_date": "2013-07-16", "title": "Crystal structure of tencon variant g10", "Keywords": ["Ibronectin type iii fold", " alternate scaffold", " de novo protein"], "authors": ["A.teplyakov", "G.obmolova", "G.l.gilliland"], "pmid": "24375666", "doi": "10.1002/PROT.24502"}
|
||||
{"classification": "Isomerase", "pdb": "2Y88", "deposition_date": "2011-02-03", "title": "Crystal structure of mycobacterium tuberculosis phosphoribosyl Isomerase (variant d11n) with bound prfar", "Keywords": ["Romatic amino acid biosynthesis", " isomerase", " tim-barrel", " histidine", "Biosynthesis", "Tryptophan biosynthesis"], "authors": ["J.kuper", "A.v.due", "A.geerlof", "M.wilmanns"], "pmid": "21321225", "doi": "10.1073/PNAS.1015996108"}
|
||||
{"classification": "Unknown function", "pdb": "1SR0", "deposition_date": "2004-03-22", "title": "Crystal structure of signalling protein from sheep(sps-40) at 3.0a Resolution using crystal grown in the presence of polysaccharides", "Keywords": ["Ignalling protein", " involution", " unknown function"], "authors": ["D.b.srivastava", "A.s.ethayathulla", "N.singh", "J.kumar", "S.sharma", "T.p.singh"]}
|
||||
{"classification": "Dna binding protein", "pdb": "3RH2", "deposition_date": "2011-04-11", "title": "Crystal structure of a tetr-like transcriptional regulator (sama_0099) From shewanella amazonensis sb2b at 2.42 a resolution", "Keywords": ["Na/rna-binding 3-helical bundle", " structural genomics", " joint center", "For structural genomics", "Jcsg", "Protein structure initiative", "Psi-", "Biology", "Dna binding protein"], "authors": ["Joint center for structural genomics (jcsg)"]}
|
||||
{"classification": "Transferase", "pdb": "2WK5", "deposition_date": "2009-06-05", "title": "Structural features of native human thymidine phosphorylase And in complex with 5-iodouracil", "Keywords": ["Lycosyltransferase", " developmental protein", " angiogenesis", "", "5-iodouracil", "Growth factor", "Enzyme kinetics", "", "Differentiation", "Disease mutation", "Thymidine", "Phosphorylase", "Chemotaxis", "Transferase", "Mutagenesis", "", "Polymorphism"], "authors": ["E.mitsiki", "A.c.papageorgiou", "S.iyer", "N.thiyagarajan", "S.h.prior", "", "D.sleep", "C.finnis", "K.r.acharya"], "pmid": "19555658", "doi": "10.1016/J.BBRC.2009.06.104"}
|
||||
{"classification": "Hydrolase", "pdb": "3P9Y", "deposition_date": "2010-10-18", "title": "Crystal structure of the drosophila melanogaster ssu72-pctd complex", "Keywords": ["Hosphatase", " cis proline", " lmw ptp-like fold", " rna polymerase ii ctd", "", "Hydrolase"], "authors": ["J.w.werner-allen", "P.zhou"], "pmid": "21159777", "doi": "10.1074/JBC.M110.197129"}
|
||||
{"classification": "Recombination/dna", "pdb": "6OEO", "deposition_date": "2019-03-27", "title": "Cryo-em structure of mouse rag1/2 nfc complex (dna1)", "Keywords": ["(d)j recombination", " dna transposition", " rag", " scid", " recombination", "", "Recombination-dna complex"], "authors": ["X.chen", "Y.cui", "Z.h.zhou", "W.yang", "M.gellert"], "pmid": "32015552", "doi": "10.1038/S41594-019-0363-2"}
|
||||
{"classification": "Hydrolase", "pdb": "4ECA", "deposition_date": "1997-02-21", "title": "Asparaginase from e. coli, mutant t89v with covalently bound aspartate", "Keywords": ["Ydrolase", " acyl-enzyme intermediate", " threonine amidohydrolase"], "authors": ["G.j.palm", "J.lubkowski", "A.wlodawer"], "pmid": "8706862", "doi": "10.1016/0014-5793(96)00660-6"}
|
||||
{"classification": "Transcription/protein binding", "pdb": "3UVX", "deposition_date": "2011-11-30", "title": "Crystal structure of the first bromodomain of human brd4 in complex With a diacetylated histone 4 peptide (h4k12ack16ac)", "Keywords": ["Romodomain", " bromodomain containing protein 4", " cap", " hunk1", " mcap", "", "Mitotic chromosome associated protein", "Peptide complex", "Structural", "Genomics consortium", "Sgc", "Transcription-protein binding complex"], "authors": ["P.filippakopoulos", "S.picaud", "T.keates", "E.ugochukwu", "F.von delft", "", "C.h.arrowsmith", "A.m.edwards", "J.weigelt", "C.bountra", "S.knapp", "Structural", "Genomics consortium (sgc)"], "pmid": "22464331", "doi": "10.1016/J.CELL.2012.02.013"}
|
||||
{"classification": "Membrane protein", "pdb": "1TLZ", "deposition_date": "2004-06-10", "title": "Tsx structure complexed with uridine", "Keywords": ["Ucleoside transporter", " beta barrel", " uridine", " membrane", "Protein"], "authors": ["J.ye", "B.van den berg"], "pmid": "15272310", "doi": "10.1038/SJ.EMBOJ.7600330"}
|
||||
{"classification": "Dna binding protein", "pdb": "7AZD", "deposition_date": "2020-11-16", "title": "Dna polymerase sliding clamp from escherichia coli with peptide 20 Bound", "Keywords": ["Ntibacterial drug", " dna binding protein"], "authors": ["C.monsarrat", "G.compain", "C.andre", "I.martiel", "S.engilberge", "V.olieric", "", "P.wolff", "K.brillet", "M.landolfo", "C.silva da veiga", "J.wagner", "G.guichard", "", "D.y.burnouf"], "pmid": "34806883", "doi": "10.1021/ACS.JMEDCHEM.1C00918"}
|
||||
{"classification": "Transferase", "pdb": "5N3K", "deposition_date": "2017-02-08", "title": "Camp-dependent protein kinase a from cricetulus griseus in complex With fragment like molecule o-guanidino-l-homoserine", "Keywords": ["Ragment", " complex", " transferase", " serine threonine kinase", " camp", "", "Kinase", "Pka"], "authors": ["C.siefker", "A.heine", "G.klebe"]}
|
||||
{"classification": "Biosynthetic protein", "pdb": "8H52", "deposition_date": "2022-10-11", "title": "Crystal structure of helicobacter pylori carboxyspermidine Dehydrogenase in complex with nadp", "Keywords": ["Arboxyspermidine dehydrogenase", " biosynthetic protein"], "authors": ["K.y.ko", "S.c.park", "S.y.cho", "S.i.yoon"], "pmid": "36283333", "doi": "10.1016/J.BBRC.2022.10.049"}
|
||||
{"classification": "Metal binding protein", "pdb": "6DYC", "deposition_date": "2018-07-01", "title": "Co(ii)-bound structure of the engineered cyt cb562 variant, ch3", "Keywords": ["Esigned protein", " 4-helix bundle", " electron transport", " metal binding", "Protein"], "authors": ["F.a.tezcan", "J.rittle"], "pmid": "30778140", "doi": "10.1038/S41557-019-0218-9"}
|
||||
{"classification": "Protein fibril", "pdb": "6A6B", "deposition_date": "2018-06-27", "title": "Cryo-em structure of alpha-synuclein fiber", "Keywords": ["Lpha-syn fiber", " parkinson disease", " protein fibril"], "authors": ["Y.w.li", "C.y.zhao", "F.luo", "Z.liu", "X.gui", "Z.luo", "X.zhang", "D.li", "C.liu", "X.li"], "pmid": "30065316", "doi": "10.1038/S41422-018-0075-X"}
|
||||
{"classification": "Dna", "pdb": "7D5E", "deposition_date": "2020-09-25", "title": "Left-handed g-quadruplex containing two bulges", "Keywords": ["-quadruplex", " bulge", " dna", " left-handed"], "authors": ["P.das", "A.maity", "K.h.ngo", "F.r.winnerdy", "B.bakalar", "Y.mechulam", "E.schmitt", "", "A.t.phan"], "pmid": "33503265", "doi": "10.1093/NAR/GKAA1259"}
|
||||
{"classification": "Transferase", "pdb": "3RSY", "deposition_date": "2011-05-02", "title": "Cellobiose phosphorylase from cellulomonas uda in complex with sulfate And glycerol", "Keywords": ["H94", " alpha barrel", " cellobiose phosphorylase", " disaccharide", "Phosphorylase", "Transferase"], "authors": ["A.van hoorebeke", "J.stout", "W.soetaert", "J.van beeumen", "T.desmet", "S.savvides"]}
|
||||
{"classification": "Oxidoreductase", "pdb": "7MCI", "deposition_date": "2021-04-02", "title": "Mofe protein from azotobacter vinelandii with a sulfur-replenished Cofactor", "Keywords": ["Zotobacter vinelandii", " mofe-protein", " nitrogenase", " oxidoreductase"], "authors": ["W.kang", "C.lee", "Y.hu", "M.w.ribbe"], "doi": "10.1038/S41929-022-00782-7"}
|
||||
{"classification": "Dna", "pdb": "1XUW", "deposition_date": "2004-10-26", "title": "Structural rationalization of a large difference in rna affinity Despite a small difference in chemistry between two 2'-o-modified Nucleic acid analogs", "Keywords": ["Na mimetic methylcarbamate amide analog", " dna"], "authors": ["R.pattanayek", "L.sethaphong", "C.pan", "M.prhavc", "T.p.prakash", "M.manoharan", "", "M.egli"], "pmid": "15547979", "doi": "10.1021/JA044637K"}
|
||||
{"classification": "Lyase", "pdb": "7C0D", "deposition_date": "2020-05-01", "title": "Crystal structure of azospirillum brasilense l-2-keto-3-deoxyarabonate Dehydratase (hydroxypyruvate-bound form)", "Keywords": ["-2-keto-3-deoxyarabonate dehydratase", " lyase"], "authors": ["Y.watanabe", "S.watanabe"], "pmid": "32697085", "doi": "10.1021/ACS.BIOCHEM.0C00515"}
|
||||
{"classification": "Signaling protein", "pdb": "5LYK", "deposition_date": "2016-09-28", "title": "Crystal structure of intracellular b30.2 domain of btn3a1 bound to Citrate", "Keywords": ["30.2", " butyrophilin", " signaling protein"], "authors": ["F.mohammed", "A.t.baker", "M.salim", "B.e.willcox"], "pmid": "28862425", "doi": "10.1021/ACSCHEMBIO.7B00694"}
|
||||
{"classification": "Toxin", "pdb": "4IZL", "deposition_date": "2013-01-30", "title": "Structure of the n248a mutant of the panton-valentine leucocidin s Component from staphylococcus aureus", "Keywords": ["I-component leucotoxin", " staphylococcus aureus", " s component", "Leucocidin", "Beta-barrel pore forming toxin", "Toxin"], "authors": ["L.maveyraud", "B.j.laventie", "G.prevost", "L.mourey"], "pmid": "24643034", "doi": "10.1371/JOURNAL.PONE.0092094"}
|
||||
{"classification": "Dna", "pdb": "6F3C", "deposition_date": "2017-11-28", "title": "The cytotoxic [pt(h2bapbpy)] platinum complex interacting with the Cgtacg hexamer", "Keywords": ["Rug-dna complex", " four-way junction", " dna"], "authors": ["M.ferraroni", "C.bazzicalupi", "P.gratteri", "F.papi"], "pmid": "31046177", "doi": "10.1002/ANIE.201814532"}
|
||||
{"classification": "Signaling protein/inhibitor", "pdb": "4L5M", "deposition_date": "2013-06-11", "title": "Complexe of arno sec7 domain with the protein-protein interaction Inhibitor n-(4-hydroxy-2,6-dimethylphenyl)benzenesulfonamide at ph6.5", "Keywords": ["Ec-7domain", " signaling protein-inhibitor complex"], "authors": ["F.hoh", "J.rouhana"], "pmid": "24112024", "doi": "10.1021/JM4009357"}
|
||||
{"classification": "Signaling protein", "pdb": "5I6J", "deposition_date": "2016-02-16", "title": "Crystal structure of srgap2 f-barx", "Keywords": ["Rgap2", " f-bar", " fx", " signaling protein"], "authors": ["M.sporny", "J.guez-haddad", "M.n.isupov", "Y.opatowsky"], "pmid": "28333212", "doi": "10.1093/MOLBEV/MSX094"}
|
||||
{"classification": "Metal binding protein", "pdb": "1Q80", "deposition_date": "2003-08-20", "title": "Solution structure and dynamics of nereis sarcoplasmic calcium binding Protein", "Keywords": ["Ll-alpha", " metal binding protein"], "authors": ["G.rabah", "R.popescu", "J.a.cox", "Y.engelborghs", "C.t.craescu"], "pmid": "15819893", "doi": "10.1111/J.1742-4658.2005.04629.X"}
|
||||
{"classification": "Transferase", "pdb": "1TW1", "deposition_date": "2004-06-30", "title": "Beta-1,4-galactosyltransferase mutant met344his (m344h-gal-t1) complex With udp-galactose and magnesium", "Keywords": ["Et344his mutation; closed conformation; mn binding", " transferase"], "authors": ["B.ramakrishnan", "E.boeggeman", "P.k.qasba"], "pmid": "15449940", "doi": "10.1021/BI049007+"}
|
||||
{"classification": "Rna", "pdb": "2PN4", "deposition_date": "2007-04-23", "title": "Crystal structure of hepatitis c virus ires subdomain iia", "Keywords": ["Cv", " ires", " subdoamin iia", " rna", " strontium", " hepatitis"], "authors": ["Q.zhao", "Q.han", "C.r.kissinger", "P.a.thompson"], "pmid": "18391410", "doi": "10.1107/S0907444908002011"}
|
||||
{"pdb": "1CW0", "title": "crystal structure analysis of very short patch repair (vsr) endonuclease in complex with a duplex dna", "authors": ["S.E.Tsutakawa", "H.Jingami", "K.Morikawa"], "doi": "10.1016/S0092-8674(00)81550-0", "pmid": "10612397"}
|
||||
{"pdb": "2CWW", "title": "crystal structure of thermus thermophilus ttha1280, a putative sam- dependent rna methyltransferase, in complex with s-adenosyl-l- homocysteine", "authors": ["A.A.Pioszak", "K.Murayama", "N.Nakagawa", "A.Ebihara", "S.Kuramitsu", "M.Shirouzu", "S.Yokoyama", "Riken Structural Genomics/proteomics Initiative (Rsgi)"], "doi": "10.1107/S1744309105029842", "pmid": "16511182"}
|
||||
{"pdb": "6CWE", "title": "structure of alpha-gsa[8,6p] bound by cd1d and in complex with the va14vb8.2 tcr", "authors": ["J.Wang", "D.Zajonc"], "doi": null, "pmid": null}
|
||||
{"pdb": "5CWS", "title": "crystal structure of the intact chaetomium thermophilum nsp1-nup49- nup57 channel nucleoporin heterotrimer bound to its nic96 nuclear pore complex attachment site", "authors": ["C.J.Bley", "S.Petrovic", "M.Paduch", "V.Lu", "A.A.Kossiakoff", "A.Hoelz"], "doi": "10.1126/SCIENCE.AAC9176", "pmid": "26316600"}
|
||||
{"pdb": "5CWE", "title": "structure of cyp107l2 from streptomyces avermitilis with lauric acid", "authors": ["T.-V.Pham", "S.-H.Han", "J.-H.Kim", "D.-H.Kim", "L.-W.Kang"], "doi": null, "pmid": null}
|
||||
{"pdb": "7CW4", "title": "acetyl-coa acetyltransferase from bacillus cereus atcc 14579", "authors": ["J.Hong", "K.J.Kim"], "doi": "10.1016/J.BBRC.2020.09.048", "pmid": "32972748"}
|
||||
{"pdb": "2CWP", "title": "crystal structure of metrs related protein from pyrococcus horikoshii", "authors": ["K.Murayama", "M.Kato-Murayama", "M.Shirouzu", "S.Yokoyama", "Riken StructuralGenomics/proteomics Initiative (Rsgi)"], "doi": null, "pmid": null}
|
||||
{"pdb": "2CW7", "title": "crystal structure of intein homing endonuclease ii", "authors": ["H.Matsumura", "H.Takahashi", "T.Inoue", "H.Hashimoto", "M.Nishioka", "S.Fujiwara", "M.Takagi", "T.Imanaka", "Y.Kai"], "doi": "10.1002/PROT.20858", "pmid": "16493661"}
|
||||
{"pdb": "1CWU", "title": "brassica napus enoyl acp reductase a138g mutant complexed with nad+ and thienodiazaborine", "authors": ["A.Roujeinikova", "J.B.Rafferty", "D.W.Rice"], "doi": "10.1074/JBC.274.43.30811", "pmid": "10521472"}
|
||||
{"pdb": "3CWN", "title": "escherichia coli transaldolase b mutant f178y", "authors": ["T.Sandalova", "G.Schneider", "A.Samland"], "doi": "10.1074/JBC.M803184200", "pmid": "18687684"}
|
||||
{"pdb": "1CWL", "title": "human cyclophilin a complexed with 4 4-hydroxy-meleu cyclosporin", "authors": ["V.Mikol", "J.Kallen", "P.Taylor", "M.D.Walkinshaw"], "doi": "10.1006/JMBI.1998.2108", "pmid": "9769216"}
|
||||
{"pdb": "3CW2", "title": "crystal structure of the intact archaeal translation initiation factor 2 from sulfolobus solfataricus .", "authors": ["E.A.Stolboushkina", "S.V.Nikonov", "A.D.Nikulin", "U.Blaesi", "D.J.Manstein", "R.V.Fedorov", "M.B.Garber", "O.S.Nikonov"], "doi": "10.1016/J.JMB.2008.07.039", "pmid": "18675278"}
|
||||
{"pdb": "3CW9", "title": "4-chlorobenzoyl-coa ligase/synthetase in the thioester-forming conformation, bound to 4-chlorophenacyl-coa", "authors": ["A.S.Reger", "J.Cao", "R.Wu", "D.Dunaway-Mariano", "A.M.Gulick"], "doi": "10.1021/BI800696Y", "pmid": "18620418"}
|
||||
{"pdb": "3CWU", "title": "crystal structure of an alka host/guest complex 2'-fluoro-2'-deoxy-1, n6-ethenoadenine:thymine base pair", "authors": ["B.R.Bowman", "S.Lee", "S.Wang", "G.L.Verdine"], "doi": "10.1016/J.STR.2008.04.012", "pmid": "18682218"}
|
||||
{"pdb": "5CWF", "title": "crystal structure of de novo designed helical repeat protein dhr8", "authors": ["G.Bhabha", "D.C.Ekiert"], "doi": "10.1038/NATURE16162", "pmid": "26675729"}
|
|
@ -1,36 +1,6 @@
|
|||
{"pid": " Q6GZX4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 43"}], "title": "Putative transcription factor 001R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZX3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 45"}], "title": "Uncharacterized protein 002L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197F8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 29"}], "title": "Uncharacterized protein 002R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q197F7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 23"}], "title": "Uncharacterized protein 003L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q6GZX2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 3R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZX1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 38"}], "title": "Uncharacterized protein 004R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197F5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 005L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q6GZX0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 47"}], "title": "Uncharacterized protein 005R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q91G88", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-06-28", "date_info": "entry version 53"}], "title": "Putative KilA-N domain-containing protein 006L", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
|
||||
{"pid": " Q6GZW9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 34"}], "title": "Uncharacterized protein 006R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZW8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 32"}], "title": "Uncharacterized protein 007R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197F3", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 007R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q197F2", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 22"}], "title": "Uncharacterized protein 008L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q6GZW6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 67"}], "title": "Putative helicase 009L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q91G85", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 38"}], "title": "Uncharacterized protein 009R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
|
||||
{"pid": " Q6GZW5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 010R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197E9", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 011L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q6GZW4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 011R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZW3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 012L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197E7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 37"}], "title": "Uncharacterized protein IIV3-013L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q6GZW2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 30"}], "title": "Uncharacterized protein 013R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZW1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 014R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZW0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 50"}], "title": "Uncharacterized protein 015R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZV8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 017L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZV7", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 018L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZV6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 87"}], "title": "Putative serine/threonine-protein kinase 019R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZV5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 40"}], "title": "Uncharacterized protein 020R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZV4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 021L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197D8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-12-14", "date_info": "entry version 35"}], "title": "Transmembrane protein 022L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q6GZV2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 023R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197D7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 25"}], "title": "Uncharacterized protein 023R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q6GZV1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 024R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q197D5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 24"}], "title": "Uncharacterized protein 025R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": " Q91G70", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 026R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
|
||||
{"pid": " Q6GZU9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 49"}], "title": "Uncharacterized protein 027R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": " Q6GZU8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 55"}], "title": "Uncharacterized protein 028R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": "Q6GZX4", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 41."}], "title": "Putative transcription factor 001R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": "Q6GZX3", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 42."}], "title": "Uncharacterized protein 002L;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": "Q197F8", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 27."}], "title": "Uncharacterized protein 002R;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": "Q197F7", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 23."}], "title": "Uncharacterized protein 003L;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
|
||||
{"pid": "Q6GZX2", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 36."}], "title": "Uncharacterized protein 3R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
|
||||
{"pid": "Q6GZX1", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 34."}], "title": "Uncharacterized protein 004R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
|
|
@ -14,12 +14,10 @@ import org.junit.jupiter.api.extension.ExtendWith
|
|||
import org.junit.jupiter.api.{BeforeEach, Test}
|
||||
import org.mockito.junit.jupiter.MockitoExtension
|
||||
|
||||
import java.io.{BufferedReader, InputStream, InputStreamReader}
|
||||
import java.io.{BufferedReader, FileInputStream, InputStream, InputStreamReader}
|
||||
import java.util.zip.GZIPInputStream
|
||||
import scala.collection.JavaConverters._
|
||||
import scala.collection.mutable.ListBuffer
|
||||
import scala.io.Source
|
||||
import scala.xml.pull.XMLEventReader
|
||||
|
||||
@ExtendWith(Array(classOf[MockitoExtension]))
|
||||
class BioScholixTest extends AbstractVocabularyTest {
|
||||
|
@ -49,11 +47,11 @@ class BioScholixTest extends AbstractVocabularyTest {
|
|||
|
||||
@Test
|
||||
def testEBIData() = {
|
||||
val inputXML = Source
|
||||
.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
|
||||
.mkString
|
||||
val xml = new XMLEventReader(Source.fromBytes(inputXML.getBytes()))
|
||||
new PMParser(xml).foreach(s => println(mapper.writeValueAsString(s)))
|
||||
val inputXML = getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml")
|
||||
|
||||
// new PubmedParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
|
||||
new PMParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
|
||||
print("DONE")
|
||||
}
|
||||
|
||||
@Test
|
||||
|
@ -89,14 +87,14 @@ class BioScholixTest extends AbstractVocabularyTest {
|
|||
|
||||
}
|
||||
|
||||
@Test
|
||||
def testParsingPubmedXML(): Unit = {
|
||||
val xml = new XMLEventReader(
|
||||
Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
|
||||
)
|
||||
val parser = new PMParser(xml)
|
||||
parser.foreach(checkPMArticle)
|
||||
}
|
||||
// @Test
|
||||
// def testParsingPubmedXML(): Unit = {
|
||||
// val xml = new XMLEventReader(
|
||||
// Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
|
||||
// )
|
||||
// val parser = new PMParser(xml)
|
||||
// parser.foreach(checkPMArticle)
|
||||
// }
|
||||
|
||||
private def checkPubmedPublication(o: Oaf): Unit = {
|
||||
assertTrue(o.isInstanceOf[Publication])
|
||||
|
@ -153,19 +151,19 @@ class BioScholixTest extends AbstractVocabularyTest {
|
|||
assertTrue(hasOldOpenAIREID)
|
||||
}
|
||||
|
||||
@Test
|
||||
def testPubmedMapping(): Unit = {
|
||||
|
||||
val xml = new XMLEventReader(
|
||||
Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
|
||||
)
|
||||
val parser = new PMParser(xml)
|
||||
val results = ListBuffer[Oaf]()
|
||||
parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
|
||||
|
||||
results.foreach(checkPubmedPublication)
|
||||
|
||||
}
|
||||
// @Test
|
||||
// def testPubmedMapping(): Unit = {
|
||||
//
|
||||
// val xml = new XMLEventReader(
|
||||
// Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
|
||||
// )
|
||||
// val parser = new PMParser(xml)
|
||||
// val results = ListBuffer[Oaf]()
|
||||
// parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
|
||||
//
|
||||
// results.foreach(checkPubmedPublication)
|
||||
//
|
||||
// }
|
||||
|
||||
@Test
|
||||
def testPDBToOAF(): Unit = {
|
||||
|
|
|
@ -2,9 +2,7 @@
|
|||
package eu.dnetlib.dhp.broker.oa.util;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.spark.sql.Row;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
@ -13,7 +11,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
|
|||
|
||||
import eu.dnetlib.broker.objects.OaBrokerMainEntity;
|
||||
import eu.dnetlib.pace.config.DedupConfig;
|
||||
import eu.dnetlib.pace.model.SparkDeduper;
|
||||
import eu.dnetlib.pace.model.SparkDedupConfig;
|
||||
import eu.dnetlib.pace.tree.support.TreeProcessor;
|
||||
|
||||
public class TrustUtils {
|
||||
|
@ -22,22 +20,18 @@ public class TrustUtils {
|
|||
|
||||
private static DedupConfig dedupConfig;
|
||||
|
||||
private static SparkDeduper deduper;
|
||||
private static SparkDedupConfig sparkDedupConfig;
|
||||
|
||||
private static final ObjectMapper mapper;
|
||||
|
||||
static {
|
||||
mapper = new ObjectMapper();
|
||||
try {
|
||||
dedupConfig = DedupConfig
|
||||
.load(
|
||||
IOUtils
|
||||
.toString(
|
||||
DedupConfig.class
|
||||
.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
|
||||
StandardCharsets.UTF_8));
|
||||
|
||||
deduper = new SparkDeduper(dedupConfig);
|
||||
dedupConfig = mapper
|
||||
.readValue(
|
||||
DedupConfig.class.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
|
||||
DedupConfig.class);
|
||||
sparkDedupConfig = new SparkDedupConfig(dedupConfig, 1);
|
||||
} catch (final IOException e) {
|
||||
log.error("Error loading dedupConfig, e");
|
||||
}
|
||||
|
@ -53,8 +47,8 @@ public class TrustUtils {
|
|||
}
|
||||
|
||||
try {
|
||||
final Row doc1 = deduper.model().rowFromJson(mapper.writeValueAsString(r1));
|
||||
final Row doc2 = deduper.model().rowFromJson(mapper.writeValueAsString(r2));
|
||||
final Row doc1 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));
|
||||
final Row doc2 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));
|
||||
|
||||
final double score = new TreeProcessor(dedupConfig).computeScore(doc1, doc2);
|
||||
|
||||
|
@ -63,7 +57,7 @@ public class TrustUtils {
|
|||
return TrustUtils.rescale(score, threshold);
|
||||
} catch (final Exception e) {
|
||||
log.error("Error computing score between results", e);
|
||||
throw new RuntimeException(e);
|
||||
return BrokerConstants.MIN_TRUST;
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
@ -83,7 +83,7 @@ public class SimpleVariableJobTest {
|
|||
|
||||
final long n = spark
|
||||
.createDataset(inputList, Encoders.STRING())
|
||||
.filter((FilterFunction<String>) s -> filter(map.get(s)))
|
||||
.filter((FilterFunction<String>) s -> filter(map.get(s)))
|
||||
.map((MapFunction<String, String>) String::toLowerCase, Encoders.STRING())
|
||||
.count();
|
||||
|
||||
|
|
|
@ -41,18 +41,54 @@
|
|||
|
||||
</build>
|
||||
|
||||
<dependencyManagement>
|
||||
<dependencies>
|
||||
<dependency>
|
||||
<groupId>io.opentelemetry</groupId>
|
||||
<artifactId>opentelemetry-bom</artifactId>
|
||||
<version>1.16.0</version>
|
||||
<type>pom</type>
|
||||
<scope>import</scope>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
</dependencyManagement>
|
||||
|
||||
<dependencies>
|
||||
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-common</artifactId>
|
||||
<version>${project.version}</version>
|
||||
<exclusions>
|
||||
<exclusion>
|
||||
<artifactId>log4j</artifactId>
|
||||
<groupId>log4j</groupId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<artifactId>annotations</artifactId>
|
||||
<groupId>org.jetbrains</groupId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<artifactId>slf4j-api</artifactId>
|
||||
<groupId>org.slf4j</groupId>
|
||||
</exclusion>
|
||||
</exclusions>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>eu.dnetlib.dhp</groupId>
|
||||
<artifactId>dhp-pace-core</artifactId>
|
||||
<version>${project.version}</version>
|
||||
<exclusions>
|
||||
<exclusion>
|
||||
<artifactId>jsr305</artifactId>
|
||||
<groupId>com.google.code.findbugs</groupId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<artifactId>javassist</artifactId>
|
||||
<groupId>org.javassist</groupId>
|
||||
</exclusion>
|
||||
</exclusions>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
|
@ -90,14 +126,17 @@
|
|||
<groupId>com.arakelian</groupId>
|
||||
<artifactId>java-jq</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>dom4j</groupId>
|
||||
<artifactId>dom4j</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>jaxen</groupId>
|
||||
<artifactId>jaxen</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>com.fasterxml.jackson.core</groupId>
|
||||
<artifactId>jackson-databind</artifactId>
|
||||
|
@ -116,6 +155,7 @@
|
|||
<version>1.4.200</version>
|
||||
<scope>test</scope>
|
||||
</dependency>
|
||||
|
||||
</dependencies>
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,103 @@
|
|||
package eu.dnetlib.dhp.oa.dedup
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser
|
||||
import eu.dnetlib.dhp.oa.dedup.dsl.{Clustering, Deduper}
|
||||
import eu.dnetlib.dhp.oa.dedup.model.BlockStats
|
||||
import eu.dnetlib.dhp.utils.ISLookupClientFactory
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.{ISLookUpException, ISLookUpService}
|
||||
import eu.dnetlib.pace.model.{RowDataOrderingComparator, SparkDedupConfig}
|
||||
import org.apache.commons.io.IOUtils
|
||||
import org.apache.spark.SparkConf
|
||||
import org.apache.spark.sql._
|
||||
import org.apache.spark.sql.catalyst.expressions.Literal
|
||||
import org.apache.spark.sql.types.DataTypes
|
||||
import org.dom4j.DocumentException
|
||||
import org.slf4j.LoggerFactory
|
||||
import org.xml.sax.SAXException
|
||||
|
||||
import java.io.IOException
|
||||
import java.util.stream.Collectors
|
||||
|
||||
object DSLExample {
|
||||
private val log = LoggerFactory.getLogger(classOf[DSLExample])
|
||||
|
||||
@throws[Exception]
|
||||
def main(args: Array[String]): Unit = {
|
||||
val parser = new ArgumentApplicationParser(
|
||||
IOUtils
|
||||
.toString(classOf[DSLExample].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json"))
|
||||
)
|
||||
parser.parseArgument(args)
|
||||
val conf = new SparkConf
|
||||
new DSLExample(parser, AbstractSparkAction.getSparkSession(conf)).run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
|
||||
}
|
||||
}
|
||||
|
||||
class DSLExample(parser: ArgumentApplicationParser, spark: SparkSession) extends AbstractSparkAction(parser, spark) {
|
||||
|
||||
def computeComparisons(blockSize: Long, slidingWindowSize: Long): Long =
|
||||
if (slidingWindowSize >= blockSize) (slidingWindowSize * (slidingWindowSize - 1)) / 2
|
||||
else (blockSize - slidingWindowSize + 1) * (slidingWindowSize * (slidingWindowSize - 1)) / 2
|
||||
|
||||
@throws[DocumentException]
|
||||
@throws[IOException]
|
||||
@throws[ISLookUpException]
|
||||
@throws[SAXException]
|
||||
override def run(isLookUpService: ISLookUpService): Unit = {
|
||||
// read oozie parameters
|
||||
val graphBasePath = parser.get("graphBasePath")
|
||||
val isLookUpUrl = parser.get("isLookUpUrl")
|
||||
val actionSetId = parser.get("actionSetId")
|
||||
val workingPath = parser.get("workingPath")
|
||||
val numPartitions : Int = Option(parser.get("numPartitions")).map(_.toInt).getOrElse(AbstractSparkAction.NUM_PARTITIONS)
|
||||
|
||||
DSLExample.log.info("graphBasePath: '{}'", graphBasePath)
|
||||
DSLExample.log.info("isLookUpUrl: '{}'", isLookUpUrl)
|
||||
DSLExample.log.info("actionSetId: '{}'", actionSetId)
|
||||
DSLExample.log.info("workingPath: '{}'", workingPath)
|
||||
// for each dedup configuration
|
||||
import scala.collection.JavaConversions._
|
||||
for (dedupConf <- getConfigurations(isLookUpService, actionSetId).subList(0, 1)) {
|
||||
val subEntity = dedupConf.getWf.getSubEntityValue
|
||||
DSLExample.log.info("Creating blockstats for: '{}'", subEntity)
|
||||
val outputPath = DedupUtility.createBlockStatsPath(workingPath, actionSetId, subEntity)
|
||||
AbstractSparkAction.removeOutputDir(spark, outputPath)
|
||||
|
||||
val sparkConfig = SparkDedupConfig(dedupConf, numPartitions)
|
||||
|
||||
val inputDF = spark.read
|
||||
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
|
||||
.transform(sparkConfig.modelExtractor)
|
||||
val simRels = inputDF
|
||||
.transform(sparkConfig.generateClusters)
|
||||
.filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)))
|
||||
|
||||
val deduper = Deduper(inputDF.schema)
|
||||
.withClustering( Clustering("sortedngrampairs"),
|
||||
Clustering("sortedngrampairs", Seq("legalname"), Map("max" -> 2, "ngramLen" -> 3)),
|
||||
Clustering("suffixprefix", Seq("legalname"), Map("max" -> 1, "len" -> 3)),
|
||||
Clustering("urlclustering", Seq("websiteurl")),
|
||||
Clustering("keywordsclustering", Seq("fields"), Map("max" -> 2, "windowSize" -> 4))
|
||||
)
|
||||
|
||||
simRels
|
||||
.map[BlockStats](
|
||||
(b:Row) => {
|
||||
val documents = b.getList(1)
|
||||
val mapDocuments = documents.stream
|
||||
.sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition))
|
||||
.limit(dedupConf.getWf.getQueueMaxSize)
|
||||
.collect(Collectors.toList)
|
||||
new BlockStats(
|
||||
b.getString(0),
|
||||
mapDocuments.size.toLong,
|
||||
computeComparisons(mapDocuments.size.toLong, dedupConf.getWf.getSlidingWindowSize.toLong)
|
||||
)
|
||||
|
||||
})(Encoders.bean[BlockStats](classOf[BlockStats]))
|
||||
.write
|
||||
.mode(SaveMode.Overwrite)
|
||||
.save(outputPath)
|
||||
}
|
||||
}
|
||||
}
|
|
@ -110,10 +110,6 @@ public class DedupRecordFactory {
|
|||
|
||||
// set authors and date
|
||||
if (ModelSupport.isSubClass(entity, Result.class)) {
|
||||
Optional
|
||||
.ofNullable(((Result) entity).getAuthor())
|
||||
.ifPresent(a -> authors.add(a));
|
||||
|
||||
((Result) entity).setAuthor(AuthorMerger.merge(authors));
|
||||
}
|
||||
|
||||
|
|
|
@ -3,8 +3,12 @@ package eu.dnetlib.dhp.oa.dedup;
|
|||
|
||||
import java.io.StringReader;
|
||||
import java.util.ArrayList;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import org.apache.spark.SparkContext;
|
||||
import org.apache.spark.util.LongAccumulator;
|
||||
import org.dom4j.Document;
|
||||
import org.dom4j.DocumentException;
|
||||
import org.dom4j.Element;
|
||||
|
|
|
@ -3,13 +3,17 @@ package eu.dnetlib.dhp.oa.dedup;
|
|||
|
||||
import java.io.IOException;
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.JavaSparkContext;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.sql.*;
|
||||
import org.apache.spark.sql.catalyst.expressions.Literal;
|
||||
import org.apache.spark.sql.types.DataTypes;
|
||||
import org.dom4j.DocumentException;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
@ -21,7 +25,8 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
|
|||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
|
||||
import eu.dnetlib.pace.config.DedupConfig;
|
||||
import eu.dnetlib.pace.model.SparkDeduper;
|
||||
import eu.dnetlib.pace.model.RowDataOrderingComparator;
|
||||
import eu.dnetlib.pace.model.SparkDedupConfig;
|
||||
|
||||
public class SparkBlockStats extends AbstractSparkAction {
|
||||
|
||||
|
@ -85,28 +90,27 @@ public class SparkBlockStats extends AbstractSparkAction {
|
|||
|
||||
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
|
||||
|
||||
SparkDeduper deduper = new SparkDeduper(dedupConf);
|
||||
SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
|
||||
|
||||
Dataset<Row> simRels = spark
|
||||
Dataset<Row> inputDF = spark
|
||||
.read()
|
||||
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
|
||||
.transform(deduper.model().parseJsonDataset())
|
||||
.transform(deduper.filterAndCleanup())
|
||||
.transform(deduper.generateClustersWithCollect())
|
||||
.filter(functions.size(new Column("block")).geq(1));
|
||||
.transform(sparkConfig.modelExtractor());
|
||||
Dataset<Row> simRels = inputDF
|
||||
.transform(sparkConfig.generateClusters())
|
||||
.filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)));
|
||||
|
||||
simRels.map((MapFunction<Row, BlockStats>) row -> {
|
||||
Collection<Row> mapDocuments = row.getList(row.fieldIndex("block"));
|
||||
simRels.map((MapFunction<Row, BlockStats>) b -> {
|
||||
Collection<Row> documents = b.getList(1);
|
||||
|
||||
/*
|
||||
* List<Row> mapDocuments = documents .stream() .sorted( new
|
||||
* RowDataOrderingComparator(deduper.model().orderingFieldPosition(),
|
||||
* deduper.model().identityFieldPosition())) .limit(dedupConf.getWf().getQueueMaxSize())
|
||||
* .collect(Collectors.toList());
|
||||
*/
|
||||
List<Row> mapDocuments = documents
|
||||
.stream()
|
||||
.sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition()))
|
||||
.limit(dedupConf.getWf().getQueueMaxSize())
|
||||
.collect(Collectors.toList());
|
||||
|
||||
return new BlockStats(
|
||||
row.getString(row.fieldIndex("key")),
|
||||
b.getString(0),
|
||||
(long) mapDocuments.size(),
|
||||
computeComparisons(
|
||||
(long) mapDocuments.size(), (long) dedupConf.getWf().getSlidingWindowSize()));
|
||||
|
|
|
@ -1,78 +0,0 @@
|
|||
package eu.dnetlib.dhp.oa.dedup
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser
|
||||
import eu.dnetlib.dhp.common.HdfsSupport
|
||||
import eu.dnetlib.dhp.schema.oaf.Relation
|
||||
import eu.dnetlib.dhp.utils.ISLookupClientFactory
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService
|
||||
import org.apache.commons.io.IOUtils
|
||||
import org.apache.spark.SparkConf
|
||||
import org.apache.spark.sql._
|
||||
import org.apache.spark.sql.functions.col
|
||||
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
|
||||
import org.slf4j.LoggerFactory
|
||||
|
||||
object SparkCleanRelation {
|
||||
private val log = LoggerFactory.getLogger(classOf[SparkCleanRelation])
|
||||
|
||||
@throws[Exception]
|
||||
def main(args: Array[String]): Unit = {
|
||||
val parser = new ArgumentApplicationParser(
|
||||
IOUtils.toString(
|
||||
classOf[SparkCleanRelation].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json")
|
||||
)
|
||||
)
|
||||
parser.parseArgument(args)
|
||||
val conf = new SparkConf
|
||||
|
||||
new SparkCleanRelation(parser, AbstractSparkAction.getSparkSession(conf))
|
||||
.run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
|
||||
}
|
||||
}
|
||||
|
||||
class SparkCleanRelation(parser: ArgumentApplicationParser, spark: SparkSession)
|
||||
extends AbstractSparkAction(parser, spark) {
|
||||
override def run(isLookUpService: ISLookUpService): Unit = {
|
||||
val graphBasePath = parser.get("graphBasePath")
|
||||
val inputPath = parser.get("inputPath")
|
||||
val outputPath = parser.get("outputPath")
|
||||
|
||||
SparkCleanRelation.log.info("graphBasePath: '{}'", graphBasePath)
|
||||
SparkCleanRelation.log.info("inputPath: '{}'", inputPath)
|
||||
SparkCleanRelation.log.info("outputPath: '{}'", outputPath)
|
||||
|
||||
AbstractSparkAction.removeOutputDir(spark, outputPath)
|
||||
|
||||
val entities =
|
||||
Seq("datasource", "project", "organization", "publication", "dataset", "software", "otherresearchproduct")
|
||||
|
||||
val idsSchema = StructType.fromDDL("`id` STRING, `dataInfo` STRUCT<`deletedbyinference`:BOOLEAN,`invisible`:BOOLEAN>")
|
||||
|
||||
val emptyIds = spark.createDataFrame(spark.sparkContext.emptyRDD[Row].setName("empty"),
|
||||
idsSchema)
|
||||
|
||||
val ids = entities
|
||||
.foldLeft(emptyIds)((ds, entity) => {
|
||||
val entityPath = graphBasePath + '/' + entity
|
||||
if (HdfsSupport.exists(entityPath, spark.sparkContext.hadoopConfiguration)) {
|
||||
ds.union(spark.read.schema(idsSchema).json(entityPath))
|
||||
} else {
|
||||
ds
|
||||
}
|
||||
})
|
||||
.filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
|
||||
.select("id")
|
||||
.distinct()
|
||||
|
||||
val relations = spark.read.schema(Encoders.bean(classOf[Relation]).schema).json(inputPath)
|
||||
.filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
|
||||
|
||||
AbstractSparkAction.save(
|
||||
relations
|
||||
.join(ids, col("source") === ids("id"), "leftsemi")
|
||||
.join(ids, col("target") === ids("id"), "leftsemi"),
|
||||
outputPath,
|
||||
SaveMode.Overwrite
|
||||
)
|
||||
}
|
||||
}
|
|
@ -23,7 +23,7 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
|
|||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
|
||||
import eu.dnetlib.pace.config.DedupConfig;
|
||||
import eu.dnetlib.pace.model.SparkDeduper;
|
||||
import eu.dnetlib.pace.model.SparkDedupConfig;
|
||||
|
||||
public class SparkCreateSimRels extends AbstractSparkAction {
|
||||
|
||||
|
@ -84,14 +84,20 @@ public class SparkCreateSimRels extends AbstractSparkAction {
|
|||
|
||||
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
|
||||
|
||||
SparkDeduper deduper = new SparkDeduper(dedupConf);
|
||||
SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
|
||||
|
||||
spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());
|
||||
|
||||
Dataset<?> simRels = spark
|
||||
.read()
|
||||
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
|
||||
.transform(deduper.model().parseJsonDataset())
|
||||
.transform(deduper.dedup())
|
||||
.distinct()
|
||||
.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
|
||||
// definition
|
||||
.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
|
||||
// filters, clusters, and model
|
||||
// definition
|
||||
.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
|
||||
// similarities
|
||||
.map(
|
||||
(MapFunction<Row, Relation>) t -> DedupUtility
|
||||
.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),
|
||||
|
|
|
@ -3,18 +3,13 @@ package eu.dnetlib.dhp.oa.dedup;
|
|||
|
||||
import static org.apache.spark.sql.functions.col;
|
||||
|
||||
import java.util.Arrays;
|
||||
import java.util.Collections;
|
||||
import java.util.Iterator;
|
||||
import java.util.Objects;
|
||||
|
||||
import org.apache.commons.beanutils.BeanUtils;
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.function.FilterFunction;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.api.java.function.ReduceFunction;
|
||||
import org.apache.spark.sql.*;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
@ -33,9 +28,9 @@ public class SparkPropagateRelation extends AbstractSparkAction {
|
|||
|
||||
private static final Logger log = LoggerFactory.getLogger(SparkPropagateRelation.class);
|
||||
|
||||
private static Encoder<Relation> REL_BEAN_ENC = Encoders.bean(Relation.class);
|
||||
|
||||
private static Encoder<Relation> REL_KRYO_ENC = Encoders.kryo(Relation.class);
|
||||
enum FieldType {
|
||||
SOURCE, TARGET
|
||||
}
|
||||
|
||||
public SparkPropagateRelation(ArgumentApplicationParser parser, SparkSession spark) {
|
||||
super(parser, spark);
|
||||
|
@ -76,62 +71,38 @@ public class SparkPropagateRelation extends AbstractSparkAction {
|
|||
Dataset<Relation> mergeRels = spark
|
||||
.read()
|
||||
.load(DedupUtility.createMergeRelPath(workingPath, "*", "*"))
|
||||
.as(REL_BEAN_ENC);
|
||||
.as(Encoders.bean(Relation.class));
|
||||
|
||||
// <mergedObjectID, dedupID>
|
||||
Dataset<Row> mergedIds = mergeRels
|
||||
Dataset<Tuple2<String, String>> mergedIds = mergeRels
|
||||
.where(col("relClass").equalTo(ModelConstants.MERGES))
|
||||
.select(col("source").as("dedupID"), col("target").as("mergedObjectID"))
|
||||
.select(col("source"), col("target"))
|
||||
.distinct()
|
||||
.map(
|
||||
(MapFunction<Row, Tuple2<String, String>>) r -> new Tuple2<>(r.getString(1), r.getString(0)),
|
||||
Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
|
||||
.cache();
|
||||
|
||||
Dataset<Row> allRels = spark
|
||||
.read()
|
||||
.schema(REL_BEAN_ENC.schema())
|
||||
.json(DedupUtility.createEntityPath(graphBasePath, "relation"));
|
||||
final String relationPath = DedupUtility.createEntityPath(graphBasePath, "relation");
|
||||
|
||||
Dataset<Relation> dedupedRels = allRels
|
||||
.joinWith(mergedIds, allRels.col("source").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
|
||||
.joinWith(mergedIds, col("_1.target").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
|
||||
.select("_1._1", "_1._2.dedupID", "_2.dedupID")
|
||||
.as(Encoders.tuple(REL_BEAN_ENC, Encoders.STRING(), Encoders.STRING()))
|
||||
.flatMap(SparkPropagateRelation::addInferredRelations, REL_KRYO_ENC);
|
||||
Dataset<Relation> rels = spark.read().textFile(relationPath).map(patchRelFn(), Encoders.bean(Relation.class));
|
||||
|
||||
Dataset<Relation> processedRelations = distinctRelations(
|
||||
dedupedRels.union(mergeRels.map((MapFunction<Relation, Relation>) r -> r, REL_KRYO_ENC)))
|
||||
.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget()));
|
||||
Dataset<Relation> newRels = createNewRels(rels, mergedIds, getFixRelFn());
|
||||
|
||||
save(processedRelations, outputRelationPath, SaveMode.Overwrite);
|
||||
}
|
||||
Dataset<Relation> updated = processDataset(
|
||||
processDataset(rels, mergedIds, FieldType.SOURCE, getDeletedFn()),
|
||||
mergedIds,
|
||||
FieldType.TARGET,
|
||||
getDeletedFn());
|
||||
|
||||
private static Iterator<Relation> addInferredRelations(Tuple3<Relation, String, String> t) throws Exception {
|
||||
Relation existingRel = t._1();
|
||||
String newSource = t._2();
|
||||
String newTarget = t._3();
|
||||
|
||||
if (newSource == null && newTarget == null) {
|
||||
return Collections.singleton(t._1()).iterator();
|
||||
}
|
||||
|
||||
// update existing relation
|
||||
if (existingRel.getDataInfo() == null) {
|
||||
existingRel.setDataInfo(new DataInfo());
|
||||
}
|
||||
existingRel.getDataInfo().setDeletedbyinference(true);
|
||||
|
||||
// Create new relation inferred by dedupIDs
|
||||
Relation inferredRel = (Relation) BeanUtils.cloneBean(existingRel);
|
||||
|
||||
inferredRel.setDataInfo((DataInfo) BeanUtils.cloneBean(existingRel.getDataInfo()));
|
||||
inferredRel.getDataInfo().setDeletedbyinference(false);
|
||||
|
||||
if (newSource != null)
|
||||
inferredRel.setSource(newSource);
|
||||
|
||||
if (newTarget != null)
|
||||
inferredRel.setTarget(newTarget);
|
||||
|
||||
return Arrays.asList(existingRel, inferredRel).iterator();
|
||||
save(
|
||||
distinctRelations(
|
||||
newRels
|
||||
.union(updated)
|
||||
.union(mergeRels)
|
||||
.map((MapFunction<Relation, Relation>) r -> r, Encoders.kryo(Relation.class)))
|
||||
.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget())),
|
||||
outputRelationPath, SaveMode.Overwrite);
|
||||
}
|
||||
|
||||
private Dataset<Relation> distinctRelations(Dataset<Relation> rels) {
|
||||
|
@ -139,13 +110,54 @@ public class SparkPropagateRelation extends AbstractSparkAction {
|
|||
.filter(getRelationFilterFunction())
|
||||
.groupByKey(
|
||||
(MapFunction<Relation, String>) r -> String
|
||||
.join(" ", r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
|
||||
.join(r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
|
||||
Encoders.STRING())
|
||||
.reduceGroups((ReduceFunction<Relation>) (b, a) -> {
|
||||
b.mergeFrom(a);
|
||||
return b;
|
||||
})
|
||||
.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, REL_BEAN_ENC);
|
||||
.agg(new RelationAggregator().toColumn())
|
||||
.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, Encoders.bean(Relation.class));
|
||||
}
|
||||
|
||||
// redirect the relations to the dedupID
|
||||
private static Dataset<Relation> createNewRels(
|
||||
Dataset<Relation> rels, // all the relations to be redirected
|
||||
Dataset<Tuple2<String, String>> mergedIds, // merge rels: <mergedObjectID, dedupID>
|
||||
MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> mapRel) {
|
||||
|
||||
// <sourceID, relation, targetID>
|
||||
Dataset<Tuple3<String, Relation, String>> mapped = rels
|
||||
.map(
|
||||
(MapFunction<Relation, Tuple3<String, Relation, String>>) r -> new Tuple3<>(getId(r, FieldType.SOURCE),
|
||||
r, getId(r, FieldType.TARGET)),
|
||||
Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class), Encoders.STRING()));
|
||||
|
||||
// < <sourceID, relation, target>, <sourceID, dedupID> >
|
||||
Dataset<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>> relSource = mapped
|
||||
.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer");
|
||||
|
||||
// < <<sourceID, relation, targetID>, <sourceID, dedupID>>, <targetID, dedupID> >
|
||||
Dataset<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>> relSourceTarget = relSource
|
||||
.joinWith(mergedIds, relSource.col("_1._3").equalTo(mergedIds.col("_1")), "left_outer");
|
||||
|
||||
return relSourceTarget
|
||||
.filter(
|
||||
(FilterFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>>) r -> r
|
||||
._1()
|
||||
._1() != null || r._2() != null)
|
||||
.map(mapRel, Encoders.bean(Relation.class))
|
||||
.distinct();
|
||||
}
|
||||
|
||||
private static Dataset<Relation> processDataset(
|
||||
Dataset<Relation> rels,
|
||||
Dataset<Tuple2<String, String>> mergedIds,
|
||||
FieldType type,
|
||||
MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> mapFn) {
|
||||
final Dataset<Tuple2<String, Relation>> mapped = rels
|
||||
.map(
|
||||
(MapFunction<Relation, Tuple2<String, Relation>>) r -> new Tuple2<>(getId(r, type), r),
|
||||
Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class)));
|
||||
return mapped
|
||||
.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer")
|
||||
.map(mapFn, Encoders.bean(Relation.class));
|
||||
}
|
||||
|
||||
private FilterFunction<Relation> getRelationFilterFunction() {
|
||||
|
@ -155,4 +167,52 @@ public class SparkPropagateRelation extends AbstractSparkAction {
|
|||
StringUtils.isNotBlank(r.getSubRelType()) ||
|
||||
StringUtils.isNotBlank(r.getRelClass());
|
||||
}
|
||||
|
||||
private static String getId(Relation r, FieldType type) {
|
||||
switch (type) {
|
||||
case SOURCE:
|
||||
return r.getSource();
|
||||
case TARGET:
|
||||
return r.getTarget();
|
||||
default:
|
||||
throw new IllegalArgumentException("");
|
||||
}
|
||||
}
|
||||
|
||||
private static MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> getFixRelFn() {
|
||||
return value -> {
|
||||
|
||||
Relation r = value._1()._1()._2();
|
||||
String newSource = value._1()._2() != null ? value._1()._2()._2() : null;
|
||||
String newTarget = value._2() != null ? value._2()._2() : null;
|
||||
|
||||
if (r.getDataInfo() == null) {
|
||||
r.setDataInfo(new DataInfo());
|
||||
}
|
||||
r.getDataInfo().setDeletedbyinference(false);
|
||||
|
||||
if (newSource != null)
|
||||
r.setSource(newSource);
|
||||
|
||||
if (newTarget != null)
|
||||
r.setTarget(newTarget);
|
||||
|
||||
return r;
|
||||
};
|
||||
}
|
||||
|
||||
private static MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> getDeletedFn() {
|
||||
return value -> {
|
||||
if (value._2() != null) {
|
||||
Relation r = value._1()._2();
|
||||
if (r.getDataInfo() == null) {
|
||||
r.setDataInfo(new DataInfo());
|
||||
}
|
||||
r.getDataInfo().setDeletedbyinference(true);
|
||||
return r;
|
||||
}
|
||||
return value._1()._2();
|
||||
};
|
||||
}
|
||||
|
||||
}
|
||||
|
|
|
@ -0,0 +1,118 @@
|
|||
|
||||
package eu.dnetlib.dhp.oa.dedup;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.Optional;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.JavaSparkContext;
|
||||
import org.apache.spark.api.java.function.MapFunction;
|
||||
import org.apache.spark.sql.*;
|
||||
import org.dom4j.DocumentException;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import org.xml.sax.SAXException;
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.dhp.application.dedup.log.DedupLogModel;
|
||||
import eu.dnetlib.dhp.application.dedup.log.DedupLogWriter;
|
||||
import eu.dnetlib.dhp.schema.oaf.Relation;
|
||||
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
|
||||
import eu.dnetlib.pace.config.DedupConfig;
|
||||
import eu.dnetlib.pace.model.SparkDedupConfig;
|
||||
|
||||
public class SparkSimRelsAnalytics extends AbstractSparkAction {
|
||||
|
||||
private static final Logger log = LoggerFactory.getLogger(SparkSimRelsAnalytics.class);
|
||||
|
||||
public SparkSimRelsAnalytics(ArgumentApplicationParser parser, SparkSession spark) {
|
||||
super(parser, spark);
|
||||
spark.sparkContext().setLogLevel("WARN");
|
||||
}
|
||||
|
||||
public static void main(String[] args) throws Exception {
|
||||
ArgumentApplicationParser parser = new ArgumentApplicationParser(
|
||||
IOUtils
|
||||
.toString(
|
||||
SparkSimRelsAnalytics.class
|
||||
.getResourceAsStream(
|
||||
"/eu/dnetlib/dhp/oa/dedup/createSimRels_parameters.json")));
|
||||
parser.parseArgument(args);
|
||||
|
||||
SparkConf conf = new SparkConf();
|
||||
new SparkSimRelsAnalytics(parser, getSparkSession(conf))
|
||||
.run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")));
|
||||
}
|
||||
|
||||
@Override
|
||||
public void run(ISLookUpService isLookUpService)
|
||||
throws DocumentException, IOException, ISLookUpException, SAXException {
|
||||
|
||||
// read oozie parameters
|
||||
final String graphBasePath = parser.get("graphBasePath");
|
||||
final String isLookUpUrl = parser.get("isLookUpUrl");
|
||||
final String actionSetId = parser.get("actionSetId");
|
||||
final String workingPath = parser.get("workingPath");
|
||||
final int numPartitions = Optional
|
||||
.ofNullable(parser.get("numPartitions"))
|
||||
.map(Integer::valueOf)
|
||||
.orElse(NUM_PARTITIONS);
|
||||
|
||||
log.info("numPartitions: '{}'", numPartitions);
|
||||
log.info("graphBasePath: '{}'", graphBasePath);
|
||||
log.info("isLookUpUrl: '{}'", isLookUpUrl);
|
||||
log.info("actionSetId: '{}'", actionSetId);
|
||||
log.info("workingPath: '{}'", workingPath);
|
||||
|
||||
final String dfLogPath = parser.get("dataframeLog");
|
||||
final String runTag = Optional.ofNullable(parser.get("runTAG")).orElse("UNKNOWN");
|
||||
|
||||
// for each dedup configuration
|
||||
for (DedupConfig dedupConf : getConfigurations(isLookUpService, actionSetId)) {
|
||||
final long start = System.currentTimeMillis();
|
||||
|
||||
final String entity = dedupConf.getWf().getEntityType();
|
||||
final String subEntity = dedupConf.getWf().getSubEntityValue();
|
||||
log.info("Creating simrels for: '{}'", subEntity);
|
||||
|
||||
final String outputPath = DedupUtility.createSimRelPath(workingPath, actionSetId, subEntity);
|
||||
removeOutputDir(spark, outputPath);
|
||||
|
||||
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
|
||||
|
||||
SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
|
||||
|
||||
spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());
|
||||
|
||||
Dataset<?> simRels = spark
|
||||
.read()
|
||||
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
|
||||
.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
|
||||
// definition
|
||||
.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
|
||||
// filters, clusters, and model
|
||||
// definition
|
||||
.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
|
||||
// similarities
|
||||
.map(
|
||||
(MapFunction<Row, Relation>) t -> DedupUtility
|
||||
.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),
|
||||
Encoders.bean(Relation.class));
|
||||
|
||||
saveParquet(simRels, outputPath, SaveMode.Overwrite);
|
||||
final long end = System.currentTimeMillis();
|
||||
if (StringUtils.isNotBlank(dfLogPath)) {
|
||||
final DedupLogModel model = new DedupLogModel(runTag, dedupConf.toString(), subEntity, start, end,
|
||||
end - start);
|
||||
new DedupLogWriter(dfLogPath).appendLog(model, spark);
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
}
|
|
@ -104,6 +104,18 @@ public class SparkWhitelistSimRels extends AbstractSparkAction {
|
|||
.join(entities, whiteListRels1.col("to").equalTo(entities.col("id")), "inner")
|
||||
.select("from", "to");
|
||||
|
||||
// Dataset<Tuple2<String, String>> whiteListRels1 = whiteListRels
|
||||
// .joinWith(entities, whiteListRels.col("_1").equalTo(entities.col("_1")), "inner")
|
||||
// .map(
|
||||
// (MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
|
||||
// Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
|
||||
//
|
||||
// Dataset<Tuple2<String, String>> whiteListRels2 = whiteListRels1
|
||||
// .joinWith(entities, whiteListRels1.col("_2").equalTo(entities.col("_1")), "inner")
|
||||
// .map(
|
||||
// (MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
|
||||
// Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
|
||||
|
||||
Dataset<Relation> whiteListSimRels = whiteListRels2
|
||||
.map(
|
||||
(MapFunction<Row, Relation>) r -> DedupUtility
|
||||
|
|
|
@ -0,0 +1,15 @@
|
|||
package eu.dnetlib.dhp.oa.dedup.dsl
|
||||
|
||||
case class Clustering(name: String = "",
|
||||
fields: Seq[String] = Seq(),
|
||||
params: Map[String,Int] = Map()) {
|
||||
def withName(name: String) : Clustering =
|
||||
copy(name = name)
|
||||
|
||||
def withFields(fields: String*): Clustering =
|
||||
copy(fields = fields)
|
||||
|
||||
def withParams(params: Map[String,Int]): Clustering =
|
||||
copy(params = params)
|
||||
}
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
package eu.dnetlib.dhp.oa.dedup.dsl
|
||||
|
||||
import org.apache.spark.sql.types.StructType
|
||||
|
||||
case class Deduper (schema: StructType,
|
||||
clusterings: Seq[Clustering] = Seq()) {
|
||||
|
||||
def withClustering(clusterings: Clustering*) =
|
||||
copy(clusterings = clusterings)
|
||||
|
||||
}
|
|
@ -1,20 +0,0 @@
|
|||
[
|
||||
{
|
||||
"paramName": "i",
|
||||
"paramLongName": "graphBasePath",
|
||||
"paramDescription": "the base path of raw graph",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "w",
|
||||
"paramLongName": "inputPath",
|
||||
"paramDescription": "the path to the input relation to cleanup",
|
||||
"paramRequired": true
|
||||
},
|
||||
{
|
||||
"paramName": "o",
|
||||
"paramLongName": "outputPath",
|
||||
"paramDescription": "the path of the output relation cleaned",
|
||||
"paramRequired": true
|
||||
}
|
||||
]
|
|
@ -15,8 +15,4 @@
|
|||
<name>oozie.action.sharelib.for.spark</name>
|
||||
<value>spark2</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkExecutorMemoryOverhead</name>
|
||||
<value>1G</value>
|
||||
</property>
|
||||
</configuration>
|
|
@ -12,26 +12,19 @@
|
|||
<name>graphOutputPath</name>
|
||||
<description>path of the output graph</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>filterInvisible</name>
|
||||
<description>whether filter out invisible entities after merge</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkDriverMemory</name>
|
||||
<description>heap memory for driver process</description>
|
||||
<description>memory for driver process</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkExecutorMemory</name>
|
||||
<description>heap memory for individual executor</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkExecutorMemoryOverhead</name>
|
||||
<description>off-heap memory for individual executor</description>
|
||||
<description>memory for individual executor</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>sparkExecutorCores</name>
|
||||
<description>number of cores used by single executor</description>
|
||||
</property>
|
||||
|
||||
<property>
|
||||
<name>oozieActionShareLibForSpark2</name>
|
||||
<description>oozie action sharelib for spark 2.*</description>
|
||||
|
@ -90,7 +83,6 @@
|
|||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
|
@ -100,35 +92,9 @@
|
|||
--conf spark.sql.shuffle.partitions=15000
|
||||
</spark-opts>
|
||||
<arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
|
||||
<arg>--graphOutputPath</arg><arg>${workingPath}/propagaterelation/</arg>
|
||||
<arg>--o</arg><arg>${graphOutputPath}</arg>
|
||||
<arg>--workingPath</arg><arg>${workingPath}</arg>
|
||||
</spark>
|
||||
<ok to="CleanRelation"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="CleanRelation">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Clean Relations</name>
|
||||
<class>eu.dnetlib.dhp.oa.dedup.SparkCleanRelation</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=15000
|
||||
</spark-opts>
|
||||
<arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/propagaterelation/relation</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/relation</arg>
|
||||
</spark>
|
||||
<ok to="group_entities"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
@ -141,9 +107,8 @@
|
|||
<class>eu.dnetlib.dhp.oa.merge.GroupEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
|
@ -154,21 +119,30 @@
|
|||
<arg>--graphInputPath</arg><arg>${graphBasePath}</arg>
|
||||
<arg>--outputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
</spark>
|
||||
<ok to="dispatch_entities"/>
|
||||
<ok to="fork_dispatch_entities"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="dispatch_entities">
|
||||
<fork name="fork_dispatch_entities">
|
||||
<path start="dispatch_datasource"/>
|
||||
<path start="dispatch_project"/>
|
||||
<path start="dispatch_organization"/>
|
||||
<path start="dispatch_publication"/>
|
||||
<path start="dispatch_dataset"/>
|
||||
<path start="dispatch_software"/>
|
||||
<path start="dispatch_otherresearchproduct"/>
|
||||
</fork>
|
||||
|
||||
<action name="dispatch_datasource">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Dispatch grouped entitities</name>
|
||||
<name>Dispatch publications</name>
|
||||
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
|
@ -177,12 +151,164 @@
|
|||
--conf spark.sql.shuffle.partitions=7680
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}</arg>
|
||||
<arg>--filterInvisible</arg><arg>${filterInvisible}</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/datasource</arg>
|
||||
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
|
||||
</spark>
|
||||
<ok to="End"/>
|
||||
<ok to="wait_dispatch"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="dispatch_project">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Dispatch project</name>
|
||||
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=7680
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/project</arg>
|
||||
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
|
||||
</spark>
|
||||
<ok to="wait_dispatch"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="dispatch_organization">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Dispatch organization</name>
|
||||
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=7680
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/organization</arg>
|
||||
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
|
||||
</spark>
|
||||
<ok to="wait_dispatch"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="dispatch_publication">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Dispatch publication</name>
|
||||
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=7680
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/publication</arg>
|
||||
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
|
||||
</spark>
|
||||
<ok to="wait_dispatch"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="dispatch_dataset">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Dispatch dataset</name>
|
||||
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=7680
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/dataset</arg>
|
||||
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
|
||||
</spark>
|
||||
<ok to="wait_dispatch"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="dispatch_software">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Dispatch software</name>
|
||||
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=7680
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/software</arg>
|
||||
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
|
||||
</spark>
|
||||
<ok to="wait_dispatch"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<action name="dispatch_otherresearchproduct">
|
||||
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||
<master>yarn</master>
|
||||
<mode>cluster</mode>
|
||||
<name>Dispatch otherresearchproduct</name>
|
||||
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
|
||||
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||
<spark-opts>
|
||||
--executor-cores=${sparkExecutorCores}
|
||||
--executor-memory=${sparkExecutorMemory}
|
||||
--driver-memory=${sparkDriverMemory}
|
||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=7680
|
||||
</spark-opts>
|
||||
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
|
||||
<arg>--outputPath</arg><arg>${graphOutputPath}/otherresearchproduct</arg>
|
||||
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
|
||||
</spark>
|
||||
<ok to="wait_dispatch"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
<join name="wait_dispatch" to="End"/>
|
||||
|
||||
<end name="End"/>
|
||||
</workflow-app>
|
|
@ -13,6 +13,10 @@
|
|||
</property>
|
||||
<property>
|
||||
<name>oozie.action.sharelib.for.spark</name>
|
||||
<value>spark2</value>
|
||||
<value>spark342</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>oozie.launcher.mapreduce.user.classpath.first</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
</configuration>
|
|
@ -126,15 +126,25 @@
|
|||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
|
||||
--conf spark.sql.shuffle.partitions=15000
|
||||
--conf spark.sql.shuffle.partitions=5000
|
||||
--conf spark.driver.extraJavaOptions="-Xss256k"
|
||||
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=spark-log4j.properties -Xss256k"
|
||||
--conf spark.extraListeners=
|
||||
--conf spark.sql.queryExecutionListeners=
|
||||
--conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecutors=100 --conf spark.dynamicAllocation.shuffleTracking.enabled=true
|
||||
--conf spark.network.io.preferDirectBufs=true --conf spark.memory.fraction=0.4 --conf spark.sql.adaptive.coalescePartitions.minPartitionNum=5000
|
||||
--conf spark.shuffle.useOldFetchProtocol=true --conf spark.shuffle.service.enabled=true --conf spark.eventLog.enabled=true
|
||||
--conf spark.executor.heartbeatInterval=60s
|
||||
--conf spark.network.timeout=640s
|
||||
--conf spark.sql.legacy.allowUntypedScalaUDF=true
|
||||
</spark-opts>
|
||||
<arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
|
||||
<arg>--isLookUpUrl</arg><arg>${isLookUpUrl}</arg>
|
||||
<arg>--actionSetId</arg><arg>${actionSetId}</arg>
|
||||
<arg>--workingPath</arg><arg>${workingPath}</arg>
|
||||
<arg>--numPartitions</arg><arg>15000</arg>
|
||||
<arg>--numPartitions</arg><arg>5000</arg>
|
||||
</spark>
|
||||
<ok to="WhitelistSimRels"/>
|
||||
<ok to="End"/>
|
||||
<error to="Kill"/>
|
||||
</action>
|
||||
|
||||
|
|
|
@ -9,8 +9,7 @@ import java.io.IOException;
|
|||
import java.io.Serializable;
|
||||
import java.lang.reflect.InvocationTargetException;
|
||||
import java.nio.file.Paths;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.*;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
import org.codehaus.jackson.map.ObjectMapper;
|
||||
|
@ -18,10 +17,7 @@ import org.junit.jupiter.api.BeforeEach;
|
|||
import org.junit.jupiter.api.Test;
|
||||
|
||||
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
|
||||
import eu.dnetlib.dhp.schema.oaf.DataInfo;
|
||||
import eu.dnetlib.dhp.schema.oaf.Publication;
|
||||
import eu.dnetlib.dhp.schema.oaf.Software;
|
||||
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
|
||||
import eu.dnetlib.dhp.schema.oaf.*;
|
||||
import eu.dnetlib.pace.util.MapDocumentUtil;
|
||||
import scala.Tuple2;
|
||||
|
||||
|
|
|
@ -0,0 +1,125 @@
|
|||
|
||||
package eu.dnetlib.dhp.oa.dedup;
|
||||
|
||||
import static java.nio.file.Files.createTempDirectory;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.mockito.Mockito.lenient;
|
||||
|
||||
import java.io.File;
|
||||
import java.io.IOException;
|
||||
import java.io.Serializable;
|
||||
import java.net.URISyntaxException;
|
||||
import java.nio.file.Paths;
|
||||
|
||||
import org.apache.commons.io.FileUtils;
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.api.java.JavaSparkContext;
|
||||
import org.apache.spark.sql.SparkSession;
|
||||
import org.junit.jupiter.api.AfterAll;
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.junit.jupiter.api.extension.ExtendWith;
|
||||
import org.mockito.Mock;
|
||||
import org.mockito.Mockito;
|
||||
import org.mockito.junit.jupiter.MockitoExtension;
|
||||
|
||||
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
|
||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
|
||||
|
||||
@ExtendWith(MockitoExtension.class)
|
||||
public class SparkDSLExampleTest implements Serializable {
|
||||
|
||||
@Mock(serializable = true)
|
||||
ISLookUpService isLookUpService;
|
||||
|
||||
private static SparkSession spark;
|
||||
private static JavaSparkContext jsc;
|
||||
|
||||
private static String testGraphBasePath;
|
||||
private static String testOutputBasePath;
|
||||
private static final String testActionSetId = "test-orchestrator";
|
||||
|
||||
@BeforeAll
|
||||
public static void beforeAll() throws IOException, URISyntaxException {
|
||||
|
||||
testGraphBasePath = Paths
|
||||
.get(SparkDedupTest.class.getResource("/eu/dnetlib/dhp/dedup/entities").toURI())
|
||||
.toFile()
|
||||
.getAbsolutePath();
|
||||
testOutputBasePath = createTempDirectory(SparkDedupTest.class.getSimpleName() + "-")
|
||||
.toAbsolutePath()
|
||||
.toString();
|
||||
|
||||
FileUtils.deleteDirectory(new File(testOutputBasePath));
|
||||
|
||||
final SparkConf conf = new SparkConf();
|
||||
conf.set("spark.sql.shuffle.partitions", "200");
|
||||
spark = SparkSession
|
||||
.builder()
|
||||
.appName(SparkDedupTest.class.getSimpleName())
|
||||
.master("local[*]")
|
||||
.config(conf)
|
||||
.getOrCreate();
|
||||
|
||||
jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
|
||||
}
|
||||
|
||||
@BeforeEach
|
||||
public void setUp() throws IOException, ISLookUpException {
|
||||
|
||||
lenient()
|
||||
.when(isLookUpService.getResourceProfileByQuery(Mockito.contains(testActionSetId)))
|
||||
.thenReturn(
|
||||
IOUtils
|
||||
.toString(
|
||||
SparkDSLExampleTest.class
|
||||
.getResourceAsStream(
|
||||
"/eu/dnetlib/dhp/dedup/profiles/mock_orchestrator.xml")));
|
||||
|
||||
lenient()
|
||||
.when(isLookUpService.getResourceProfileByQuery(Mockito.contains("organization")))
|
||||
.thenReturn(
|
||||
IOUtils
|
||||
.toString(
|
||||
SparkDSLExampleTest.class
|
||||
.getResourceAsStream(
|
||||
"/eu/dnetlib/dhp/dedup/conf/org.curr.conf.json")));
|
||||
}
|
||||
|
||||
@Test
|
||||
void createBlockStatsTest() throws Exception {
|
||||
|
||||
ArgumentApplicationParser parser = new ArgumentApplicationParser(
|
||||
IOUtils
|
||||
.toString(
|
||||
SparkDSLExampleTest.class
|
||||
.getResourceAsStream(
|
||||
"/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json")));
|
||||
parser
|
||||
.parseArgument(
|
||||
new String[] {
|
||||
"-i", testGraphBasePath,
|
||||
"-asi", testActionSetId,
|
||||
"-la", "lookupurl",
|
||||
"-w", testOutputBasePath
|
||||
});
|
||||
|
||||
new DSLExample(parser, spark).run(isLookUpService);
|
||||
|
||||
long orgs_blocks = spark
|
||||
.read()
|
||||
.textFile(testOutputBasePath + "/" + testActionSetId + "/organization_blockstats")
|
||||
.count();
|
||||
|
||||
assertEquals(480, orgs_blocks);
|
||||
}
|
||||
|
||||
@AfterAll
|
||||
public static void tearDown() {
|
||||
spark.close();
|
||||
}
|
||||
}
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue