orcid-no-doi #43

Merged
claudio.atzori merged 45 commits from enrico.ottonello/dnet-hadoop:orcid-no-doi into master 3 years ago
Collaborator

orcid publications no doi dataset generation from dump files

orcid publications no doi dataset generation from dump files
claudio.atzori requested changes 4 years ago
claudio.atzori left a comment
Owner

In this review I checked basic code practices like

  • exception propagation
  • management of side cases (methods returning null)

Please indicate the rationale behind your choices and apply the requested changes.

In this review I checked basic code practices like - exception propagation - management of side cases (methods returning null) Please indicate the rationale behind your choices and apply the requested changes.
@ -87,0 +87,4 @@
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.8</version>
Owner

Versions of dependencies should be only declared in the main pom file. Please declare this dependency there (v1.8) and refer to it without overriding the version.

Versions of dependencies should be only declared in the main pom file. Please declare this dependency there (v1.8) and refer to it without overriding the version.
@ -56,3 +55,4 @@
fs = FileSystem.get(URI.create(hdfsServerUri.concat(workingPath)), conf);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
Owner

Let the exception propagate and break the job

Let the exception propagate and break the job
@ -0,0 +128,4 @@
}
}
} catch (Exception e) {
Log
Owner

What is the reason for not handling nor let propagate this exception? I imagine that a malformed entry in the tar file could cause it, but in that case we should interrupt the procedure and deepen the analysis to spot the error. In this way the error would likely be unnoticed, but causing a drop in the number of output records.

What is the reason for not handling nor let propagate this exception? I imagine that a malformed entry in the tar file could cause it, but in that case we should interrupt the procedure and deepen the analysis to spot the error. In this way the error would likely be unnoticed, but causing a drop in the number of output records.
@ -0,0 +167,4 @@
return name.getAsString();
}
}
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
@ -0,0 +98,4 @@
"/eu/dnetlib/dhp/doiboost/orcidnodoi/mappings/typologies.json"));
typologiesMapping = new Gson().fromJson(tt, Map.class);
} catch (final Exception e) {
logger.error("loading typologies", e);
Owner

This should not happen as this is statically defined, but please let the exception propagate with some subclass of Throwable so that it will break immediately. Otherwise the typologiesMapping variable will stay defined as null causing the 1st usage to break with a NPE.

This should not happen as this is statically defined, but please let the exception propagate with some subclass of `Throwable` so that it will break immediately. Otherwise the `typologiesMapping` variable will stay defined as `null` causing the 1st usage to break with a NPE.
@ -0,0 +117,4 @@
if (errorsGeneric != null) {
errorsGeneric.add(1);
}
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a filter for null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD
.map(
e -> {
return (Publication) publicationToOaf
.generatePublicationActionsFromJson(e._2());
})
.filter(p -> p != null);

yes, there is a filter for null value: JavaRDD<Publication> oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +124,4 @@
public Oaf generatePublicationActionsFromDump(final JsonObject rootElement) {
if (!isValid(rootElement)) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a filter for null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);

yes, there is a filter for null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +175,4 @@
if (errorsInvalidTitle != null) {
errorsInvalidTitle.add(1);
}
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +245,4 @@
if (errorsInvalidType != null) {
errorsInvalidType.add(1);
}
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a filter on null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);

yes, there is a filter on null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +256,4 @@
if (errorsNotFoundAuthors != null) {
errorsNotFoundAuthors.add(1);
}
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a filter on null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);

yes, there is a filter on null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +328,4 @@
return authors;
}
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +333,4 @@
private List<String> createRepeatedField(final JsonObject rootElement, final String fieldName) {
if (!rootElement.has(fieldName)) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +336,4 @@
return null;
}
if (rootElement.has(fieldName) && rootElement.get(fieldName).isJsonNull()) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +340,4 @@
}
if (rootElement.get(fieldName).isJsonArray()) {
if (!isValidJsonArray(rootElement, fieldName)) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is a check on null value

yes, there is a check on null value
@ -0,0 +387,4 @@
try {
pubDateJson = rootElement.getAsJsonObject(jsonKey);
} catch (Exception e) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +390,4 @@
return null;
}
if (pubDateJson == null) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +397,4 @@
final String day = getStringValue(pubDateJson, "day");
if (StringUtils.isBlank(year)) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +413,4 @@
if (isValidDate(pubDate)) {
return pubDate;
}
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

yes, there is this ckeck on the value: StringUtils.isNotBlank

yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +475,4 @@
private StructuredProperty mapStructuredProperty(String value, Qualifier qualifier, DataInfo dataInfo) {
if (value == null | StringUtils.isBlank(value)) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the null? Otherwise this would likely produce a NPE.
@ -0,0 +487,4 @@
private Field<String> mapStringField(String value, DataInfo dataInfo) {
if (value == null || StringUtils.isBlank(value)) {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the null? Otherwise this would likely produce a NPE.
Poster
Collaborator

the value returned from this function is used to populate 2 fields of publication:
publication.setSource(value)
publication.setDateofacceptance(value)

the value returned from this function is used to populate 2 fields of publication: publication.setSource(value) publication.setDateofacceptance(value)
@ -0,0 +20,4 @@
public static String getStringValue(final JsonObject root, final String key) {
if (root.has(key) && !root.get(key).isJsonNull())
return root.get(key).getAsString();
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the null? Otherwise this would likely produce a NPE.
Poster
Collaborator

replaced null value with a more safe empty string

replaced null value with a more safe empty string
@ -0,0 +69,4 @@
final String id = (workNodes.get(0).getAttributes().get("put-code"));
workData.setId(id);
} else {
return null;
Owner

Is the caller expecting the null? Otherwise this would likely produce a NPE.

Is the caller expecting the `null`? Otherwise this would likely produce a NPE.
Poster
Collaborator

in this case null value indicate that the original publication in xml format was not parsed, because of mandatory information was not found; there is a check on this null value

in this case null value indicate that the original publication in xml format was not parsed, because of mandatory information was not found; there is a check on this null value
@ -0,0 +29,4 @@
import eu.dnetlib.doiboost.orcidnodoi.similarity.AuthorMatcher;
import jdk.nashorn.internal.ir.annotations.Ignore;
public class OrcidNoDoiTest {
Owner

Why are the tests in this class marked as @Ignore? I see you spent some effort implementing them, I didn't dive much into the details, but I see assertions properly defined so if they are valuable I suggest to keep the test as active.

Why are the tests in this class marked as `@Ignore`? I see you spent some effort implementing them, I didn't dive much into the details, but I see assertions properly defined so if they are valuable I suggest to keep the test as active.
claudio.atzori reviewed 4 years ago
Owner

Please remove the dependency version at all from here. The version has to be declared only in the main pom file. Just like all the other dependencies in the same pom file.

Please remove the dependency version at all from here. The version has to be declared *only* in the main pom file. Just like all the other dependencies in the same pom file.
claudio.atzori requested changes 4 years ago
claudio.atzori left a comment
Owner

Please remove the dependency version at all from dhp-workflows/dhp-doiboost/pom.xml. All dependency versions has to be declared only in the main pom file.

Please remove the dependency version at all from `dhp-workflows/dhp-doiboost/pom.xml`. All dependency versions has to be declared only in the main pom file.
Poster
Collaborator

All needed modified was done for now. I updated the fork with current master.

All needed modified was done for now. I updated the fork with current master.
claudio.atzori reviewed 4 years ago
Owner

I still see the dependency version declared here. Please move in the project's main pom under the dependencyManagement section.

I still see the dependency version declared here. Please move in the project's main pom under the _dependencyManagement_ section.
claudio.atzori requested changes 4 years ago
claudio.atzori left a comment
Owner

Please revise the dependency section in the pom file. The version for the dependency org.apache.commons:commons-text must be moved in the project's main pom file.

More specifically, I refer to this dependency: https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/dhp-workflows/dhp-doiboost/pom.xml#L90

Please revise the dependency section in the pom file. The version for the dependency `org.apache.commons:commons-text` must be moved in the project's main pom file. More specifically, I refer to this dependency: https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/dhp-workflows/dhp-doiboost/pom.xml#L90
Poster
Collaborator

I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689

I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689
Owner

I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689

... well no, you moved the version property declaration in the main pom file. You should move the dependency declaration in the dependencyManagement section in the main pom file, and then in the doiboost submodule express the dependency towards commons-text without specifying any version. The rationale here is that ALL the versions for the external libraries we're depending on must be expressed in a single place. Otherwise if everybody starts to indicate arbitrary library versions in each submodule, the classpath would become a jungle of libraries, likely with conflicting versions of the same library.

> I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689 ... well no, you moved the version property declaration in the main pom file. You should move the dependency declaration in the `dependencyManagement` section in the main pom file, and then in the doiboost submodule express the dependency towards `commons-text` without specifying any version. The rationale here is that ALL the versions for the external libraries we're depending on must be expressed in a single place. Otherwise if everybody starts to indicate arbitrary library versions in each submodule, the classpath would become a jungle of libraries, likely with conflicting versions of the same library.
Owner

I just noticed that in SparkGenEnrichedOrcidWorks.java the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a

SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text> where

I just noticed that in [SparkGenEnrichedOrcidWorks.java](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/commit/fea2451658a30f36c751741cbbb103fdaee33d5b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcidnodoi/SparkGenEnrichedOrcidWorks.java) the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a ```SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>``` where * keys are defined as the entity type class fully qualified name (e.g. [`eu.dnetlib.dhp.schema.oaf.Publication`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java)) * values are defined as [`eu.dnetlib.dhp.schema.action.AtomicAction`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/action/AtomicAction.java)s, a simple wrapper class with just two fields: 1) `Class<T> clazz`; and 2) `T payload`; where `T extends Oaf`.
claudio.atzori requested changes 4 years ago
claudio.atzori left a comment
Owner

I just noticed that in SparkGenEnrichedOrcidWorks.java the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a

SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text> where

I just noticed that in [SparkGenEnrichedOrcidWorks.java](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/commit/fea2451658a30f36c751741cbbb103fdaee33d5b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcidnodoi/SparkGenEnrichedOrcidWorks.java) the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a ```SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>``` where * keys are defined as the entity type class fully qualified name (e.g. [`eu.dnetlib.dhp.schema.oaf.Publication`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java)) * values are defined as [`eu.dnetlib.dhp.schema.action.AtomicAction`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/action/AtomicAction.java)s, a simple wrapper class with just two fields: 1) `Class<T> clazz`; and 2) `T payload`; where `T extends Oaf`.
@ -0,0 +128,4 @@
.createDataset(
oafPublicationRDD.repartition(1).rdd(),
Encoders.bean(Publication.class));
publicationDataset
Owner

I just noticed the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a

SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text> where

I just noticed the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a ```SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>``` where * keys are defined as the entity type class fully qualified name (e.g. [`eu.dnetlib.dhp.schema.oaf.Publication`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java)) * values are defined as [`eu.dnetlib.dhp.schema.action.AtomicAction`](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/action/AtomicAction.java)s, a simple wrapper class with just two fields: 1) `Class<T> clazz`; and 2) `T payload`; where `T extends Oaf`.
claudio.atzori added the
enhancement
label 4 years ago
claudio.atzori self-assigned this 4 years ago
Poster
Collaborator

Now the dataset output is in the correct format.

Now the dataset output is in the correct format.
enrico.ottonello closed this pull request 4 years ago
enrico.ottonello reopened this pull request 4 years ago
claudio.atzori closed this pull request 3 years ago
The pull request has been merged as faa977df7e.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b enrico.ottonello-orcid-no-doi master
git pull orcid-no-doi

Step 2:

Merge the changes and update on Gitea.
git checkout master
git merge --no-ff enrico.ottonello-orcid-no-doi
git push origin master
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#43
Loading…
There is no content yet.