orcid-no-doi #43

Merged
claudio.atzori merged 45 commits from enrico.ottonello/dnet-hadoop:orcid-no-doi into master 2020-12-02 10:55:12 +01:00
Contributor

orcid publications no doi dataset generation from dump files

orcid publications no doi dataset generation from dump files
claudio.atzori requested changes 2020-09-21 16:25:25 +02:00
claudio.atzori left a comment
Owner

In this review I checked basic code practices like

  • exception propagation
  • management of side cases (methods returning null)

Please indicate the rationale behind your choices and apply the requested changes.

In this review I checked basic code practices like - exception propagation - management of side cases (methods returning null) Please indicate the rationale behind your choices and apply the requested changes.
@ -87,0 +87,4 @@
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.8</version>

Versions of dependencies should be only declared in the main pom file. Please declare this dependency there (v1.8) and refer to it without overriding the version.

Versions of dependencies should be only declared in the main pom file. Please declare this dependency there (v1.8) and refer to it without overriding the version.
@ -56,3 +55,4 @@
fs = FileSystem.get(URI.create(hdfsServerUri.concat(workingPath)), conf);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();

Let the exception propagate and break the job

Let the exception propagate and break the job
@ -0,0 +128,4 @@
}
}
} catch (Exception e) {
Log

What is the reason for not handling nor let propagate this exception? I imagine that a malformed entry in the tar file could cause it, but in that case we should interrupt the procedure and deepen the analysis to spot the error. In this way the error would likely be unnoticed, but causing a drop in the number of output records.

What is the reason for not handling nor let propagate this exception? I imagine that a malformed entry in the tar file could cause it, but in that case we should interrupt the procedure and deepen the analysis to spot the error. In this way the error would likely be unnoticed, but causing a drop in the number of output records.
@ -0,0 +167,4 @@
return name.getAsString();
}
}
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
@ -0,0 +98,4 @@
"/eu/dnetlib/dhp/doiboost/orcidnodoi/mappings/typologies.json"));
typologiesMapping = new Gson().fromJson(tt, Map.class);
} catch (final Exception e) {
logger.error("loading typologies", e);

This should not happen as this is statically defined, but please let the exception propagate with some subclass of Throwable so that it will break immediately. Otherwise the typologiesMapping variable will stay defined as null causing the 1st usage to break with a NPE.

This should not happen as this is statically defined, but please let the exception propagate with some subclass of `Throwable` so that it will break immediately. Otherwise the `typologiesMapping` variable will stay defined as `null` causing the 1st usage to break with a NPE.
@ -0,0 +117,4 @@
if (errorsGeneric != null) {
errorsGeneric.add(1);
}
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a filter for null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD
.map(
e -> {
return (Publication) publicationToOaf
.generatePublicationActionsFromJson(e._2());
})
.filter(p -> p != null);

yes, there is a filter for null value: JavaRDD<Publication> oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +124,4 @@
public Oaf generatePublicationActionsFromDump(final JsonObject rootElement) {
if (!isValid(rootElement)) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a filter for null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);

yes, there is a filter for null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +175,4 @@
if (errorsInvalidTitle != null) {
errorsInvalidTitle.add(1);
}
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +245,4 @@
if (errorsInvalidType != null) {
errorsInvalidType.add(1);
}
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a filter on null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);

yes, there is a filter on null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +256,4 @@
if (errorsNotFoundAuthors != null) {
errorsNotFoundAuthors.add(1);
}
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a filter on null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);

yes, there is a filter on null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +328,4 @@
return authors;
}
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +333,4 @@
private List<String> createRepeatedField(final JsonObject rootElement, final String fieldName) {
if (!rootElement.has(fieldName)) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +336,4 @@
return null;
}
if (rootElement.has(fieldName) && rootElement.get(fieldName).isJsonNull()) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +340,4 @@
}
if (rootElement.get(fieldName).isJsonArray()) {
if (!isValidJsonArray(rootElement, fieldName)) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +387,4 @@
try {
pubDateJson = rootElement.getAsJsonObject(jsonKey);
} catch (Exception e) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +390,4 @@
return null;
}
if (pubDateJson == null) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +397,4 @@
final String day = getStringValue(pubDateJson, "day");
if (StringUtils.isBlank(year)) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +413,4 @@
if (isValidDate(pubDate)) {
return pubDate;
}
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +475,4 @@
private StructuredProperty mapStructuredProperty(String value, Qualifier qualifier, DataInfo dataInfo) {
if (value == null | StringUtils.isBlank(value)) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the null? Otherwise this would likely produce a NPE.
@ -0,0 +487,4 @@
private Field<String> mapStringField(String value, DataInfo dataInfo) {
if (value == null || StringUtils.isBlank(value)) {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the null? Otherwise this would likely produce a NPE.
Author
Contributor

the value returned from this function is used to populate 2 fields of publication:
publication.setSource(value)
publication.setDateofacceptance(value)

the value returned from this function is used to populate 2 fields of publication: publication.setSource(value) publication.setDateofacceptance(value)
@ -0,0 +20,4 @@
public static String getStringValue(final JsonObject root, final String key) {
if (root.has(key) && !root.get(key).isJsonNull())
return root.get(key).getAsString();
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the null? Otherwise this would likely produce a NPE.
Author
Contributor

replaced null value with a more safe empty string

replaced null value with a more safe empty string
@ -0,0 +69,4 @@
final String id = (workNodes.get(0).getAttributes().get("put-code"));
workData.setId(id);
} else {
return null;

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Author
Contributor

in this case null value indicate that the original publication in xml format was not parsed, because of mandatory information was not found; there is a check on this null value

in this case null value indicate that the original publication in xml format was not parsed, because of mandatory information was not found; there is a check on this null value
@ -0,0 +29,4 @@
import eu.dnetlib.doiboost.orcidnodoi.similarity.AuthorMatcher;
import jdk.nashorn.internal.ir.annotations.Ignore;
public class OrcidNoDoiTest {

Why are the tests in this class marked as @Ignore? I see you spent some effort implementing them, I didn't dive much into the details, but I see assertions properly defined so if they are valuable I suggest to keep the test as active.

Why are the tests in this class marked as `@Ignore`? I see you spent some effort implementing them, I didn't dive much into the details, but I see assertions properly defined so if they are valuable I suggest to keep the test as active.
claudio.atzori reviewed 2020-09-22 11:10:54 +02:00

Please remove the dependency version at all from here. The version has to be declared only in the main pom file. Just like all the other dependencies in the same pom file.

Please remove the dependency version at all from here. The version has to be declared *only* in the main pom file. Just like all the other dependencies in the same pom file.
claudio.atzori requested changes 2020-09-22 12:34:57 +02:00
claudio.atzori left a comment
Owner

Please remove the dependency version at all from dhp-workflows/dhp-doiboost/pom.xml. All dependency versions has to be declared only in the main pom file.

Please remove the dependency version at all from `dhp-workflows/dhp-doiboost/pom.xml`. All dependency versions has to be declared only in the main pom file.
Author
Contributor

All needed modified was done for now. I updated the fork with current master.

All needed modified was done for now. I updated the fork with current master.
claudio.atzori reviewed 2020-11-03 17:02:04 +01:00

I still see the dependency version declared here. Please move in the project's main pom under the dependencyManagement section.

I still see the dependency version declared here. Please move in the project's main pom under the _dependencyManagement_ section.
claudio.atzori requested changes 2020-11-03 17:06:43 +01:00
claudio.atzori left a comment
Owner

Please revise the dependency section in the pom file. The version for the dependency org.apache.commons:commons-text must be moved in the project's main pom file.

More specifically, I refer to this dependency: https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/dhp-workflows/dhp-doiboost/pom.xml#L90

Please revise the dependency section in the pom file. The version for the dependency `org.apache.commons:commons-text` must be moved in the project's main pom file. More specifically, I refer to this dependency: https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/dhp-workflows/dhp-doiboost/pom.xml#L90
Author
Contributor

I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689

I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689

I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689

... well no, you moved the version property declaration in the main pom file. You should move the dependency declaration in the dependencyManagement section in the main pom file, and then in the doiboost submodule express the dependency towards commons-text without specifying any version. The rationale here is that ALL the versions for the external libraries we're depending on must be expressed in a single place. Otherwise if everybody starts to indicate arbitrary library versions in each submodule, the classpath would become a jungle of libraries, likely with conflicting versions of the same library.

> I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689 ... well no, you moved the version property declaration in the main pom file. You should move the dependency declaration in the `dependencyManagement` section in the main pom file, and then in the doiboost submodule express the dependency towards `commons-text` without specifying any version. The rationale here is that ALL the versions for the external libraries we're depending on must be expressed in a single place. Otherwise if everybody starts to indicate arbitrary library versions in each submodule, the classpath would become a jungle of libraries, likely with conflicting versions of the same library.

I just noticed that in SparkGenEnrichedOrcidWorks.java the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a

SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text> where

I just noticed that in [SparkGenEnrichedOrcidWorks.java](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/commit/fea2451658a30f36c751741cbbb103fdaee33d5b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcidnodoi/SparkGenEnrichedOrcidWorks.java) the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a ```SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>``` where * keys are defined as the entity type class fully qualified name (e.g. [`eu.dnetlib.dhp.schema.oaf.Publication`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java)) * values are defined as [`eu.dnetlib.dhp.schema.action.AtomicAction`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/action/AtomicAction.java)s, a simple wrapper class with just two fields: 1) `Class<T> clazz`; and 2) `T payload`; where `T extends Oaf`.
claudio.atzori requested changes 2020-11-11 14:26:05 +01:00
claudio.atzori left a comment
Owner

I just noticed that in SparkGenEnrichedOrcidWorks.java the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a

SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text> where

I just noticed that in [SparkGenEnrichedOrcidWorks.java](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/commit/fea2451658a30f36c751741cbbb103fdaee33d5b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcidnodoi/SparkGenEnrichedOrcidWorks.java) the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a ```SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>``` where * keys are defined as the entity type class fully qualified name (e.g. [`eu.dnetlib.dhp.schema.oaf.Publication`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java)) * values are defined as [`eu.dnetlib.dhp.schema.action.AtomicAction`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/action/AtomicAction.java)s, a simple wrapper class with just two fields: 1) `Class<T> clazz`; and 2) `T payload`; where `T extends Oaf`.
@ -0,0 +128,4 @@
.createDataset(
oafPublicationRDD.repartition(1).rdd(),
Encoders.bean(Publication.class));
publicationDataset

I just noticed the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a

SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text> where

I just noticed the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a ```SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>``` where * keys are defined as the entity type class fully qualified name (e.g. [`eu.dnetlib.dhp.schema.oaf.Publication`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java)) * values are defined as [`eu.dnetlib.dhp.schema.action.AtomicAction`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/action/AtomicAction.java)s, a simple wrapper class with just two fields: 1) `Class<T> clazz`; and 2) `T payload`; where `T extends Oaf`.
claudio.atzori added the
enhancement
label 2020-11-11 14:30:40 +01:00
claudio.atzori self-assigned this 2020-11-11 14:30:45 +01:00
Author
Contributor

Now the dataset output is in the correct format.

Now the dataset output is in the correct format.
enrico.ottonello closed this pull request 2020-11-13 10:37:00 +01:00
enrico.ottonello reopened this pull request 2020-11-13 10:37:33 +01:00
claudio.atzori closed this pull request 2020-12-02 10:55:12 +01:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#43
No description provided.