orcid-no-doi #43
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#43
Loading…
Reference in New Issue
No description provided.
Delete Branch "enrico.ottonello/dnet-hadoop:orcid-no-doi"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
orcid publications no doi dataset generation from dump files
In this review I checked basic code practices like
Please indicate the rationale behind your choices and apply the requested changes.
@ -87,0 +87,4 @@
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.8</version>
Versions of dependencies should be only declared in the main pom file. Please declare this dependency there (v1.8) and refer to it without overriding the version.
@ -56,3 +55,4 @@
fs = FileSystem.get(URI.create(hdfsServerUri.concat(workingPath)), conf);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
Let the exception propagate and break the job
@ -0,0 +128,4 @@
}
}
} catch (Exception e) {
Log
What is the reason for not handling nor let propagate this exception? I imagine that a malformed entry in the tar file could cause it, but in that case we should interrupt the procedure and deepen the analysis to spot the error. In this way the error would likely be unnoticed, but causing a drop in the number of output records.
@ -0,0 +167,4 @@
return name.getAsString();
}
}
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.@ -0,0 +98,4 @@
"/eu/dnetlib/dhp/doiboost/orcidnodoi/mappings/typologies.json"));
typologiesMapping = new Gson().fromJson(tt, Map.class);
} catch (final Exception e) {
logger.error("loading typologies", e);
This should not happen as this is statically defined, but please let the exception propagate with some subclass of
Throwable
so that it will break immediately. Otherwise thetypologiesMapping
variable will stay defined asnull
causing the 1st usage to break with a NPE.@ -0,0 +117,4 @@
if (errorsGeneric != null) {
errorsGeneric.add(1);
}
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a filter for null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD
.map(
e -> {
return (Publication) publicationToOaf
.generatePublicationActionsFromJson(e._2());
})
.filter(p -> p != null);
@ -0,0 +124,4 @@
public Oaf generatePublicationActionsFromDump(final JsonObject rootElement) {
if (!isValid(rootElement)) {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a filter for null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +175,4 @@
if (errorsInvalidTitle != null) {
errorsInvalidTitle.add(1);
}
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a check on null value
@ -0,0 +245,4 @@
if (errorsInvalidType != null) {
errorsInvalidType.add(1);
}
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a filter on null value:
JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +256,4 @@
if (errorsNotFoundAuthors != null) {
errorsNotFoundAuthors.add(1);
}
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a filter on null value: JavaRDD oafPublicationRDD = enrichedWorksRDD .map( e -> { return (Publication) publicationToOaf .generatePublicationActionsFromJson(e._2()); }) .filter(p -> p != null);
@ -0,0 +328,4 @@
return authors;
}
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a check on null value
@ -0,0 +333,4 @@
private List<String> createRepeatedField(final JsonObject rootElement, final String fieldName) {
if (!rootElement.has(fieldName)) {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a check on null value
@ -0,0 +336,4 @@
return null;
}
if (rootElement.has(fieldName) && rootElement.get(fieldName).isJsonNull()) {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a check on null value
@ -0,0 +340,4 @@
}
if (rootElement.get(fieldName).isJsonArray()) {
if (!isValidJsonArray(rootElement, fieldName)) {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is a check on null value
@ -0,0 +387,4 @@
try {
pubDateJson = rootElement.getAsJsonObject(jsonKey);
} catch (Exception e) {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +390,4 @@
return null;
}
if (pubDateJson == null) {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +397,4 @@
final String day = getStringValue(pubDateJson, "day");
if (StringUtils.isBlank(year)) {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +413,4 @@
if (isValidDate(pubDate)) {
return pubDate;
}
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.yes, there is this ckeck on the value: StringUtils.isNotBlank
@ -0,0 +475,4 @@
private StructuredProperty mapStructuredProperty(String value, Qualifier qualifier, DataInfo dataInfo) {
if (value == null | StringUtils.isBlank(value)) {
return null;
Is the caller expecting the null? Otherwise this would likely produce a NPE.
@ -0,0 +487,4 @@
private Field<String> mapStringField(String value, DataInfo dataInfo) {
if (value == null || StringUtils.isBlank(value)) {
return null;
Is the caller expecting the null? Otherwise this would likely produce a NPE.
the value returned from this function is used to populate 2 fields of publication:
publication.setSource(value)
publication.setDateofacceptance(value)
@ -0,0 +20,4 @@
public static String getStringValue(final JsonObject root, final String key) {
if (root.has(key) && !root.get(key).isJsonNull())
return root.get(key).getAsString();
return null;
Is the caller expecting the null? Otherwise this would likely produce a NPE.
replaced null value with a more safe empty string
@ -0,0 +69,4 @@
final String id = (workNodes.get(0).getAttributes().get("put-code"));
workData.setId(id);
} else {
return null;
Is the caller expecting the
null
? Otherwise this would likely produce a NPE.in this case null value indicate that the original publication in xml format was not parsed, because of mandatory information was not found; there is a check on this null value
@ -0,0 +29,4 @@
import eu.dnetlib.doiboost.orcidnodoi.similarity.AuthorMatcher;
import jdk.nashorn.internal.ir.annotations.Ignore;
public class OrcidNoDoiTest {
Why are the tests in this class marked as
@Ignore
? I see you spent some effort implementing them, I didn't dive much into the details, but I see assertions properly defined so if they are valuable I suggest to keep the test as active.Please remove the dependency version at all from here. The version has to be declared only in the main pom file. Just like all the other dependencies in the same pom file.
Please remove the dependency version at all from
dhp-workflows/dhp-doiboost/pom.xml
. All dependency versions has to be declared only in the main pom file.All needed modified was done for now. I updated the fork with current master.
I still see the dependency version declared here. Please move in the project's main pom under the dependencyManagement section.
Please revise the dependency section in the pom file. The version for the dependency
org.apache.commons:commons-text
must be moved in the project's main pom file.More specifically, I refer to this dependency: https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/dhp-workflows/dhp-doiboost/pom.xml#L90
I had already moved that dependency in the main pom.xml file, as you can see here https://code-repo.d4science.org/enrico.ottonello/dnet-hadoop/src/branch/orcid-no-doi/pom.xml#L689
... well no, you moved the version property declaration in the main pom file. You should move the dependency declaration in the
dependencyManagement
section in the main pom file, and then in the doiboost submodule express the dependency towardscommons-text
without specifying any version. The rationale here is that ALL the versions for the external libraries we're depending on must be expressed in a single place. Otherwise if everybody starts to indicate arbitrary library versions in each submodule, the classpath would become a jungle of libraries, likely with conflicting versions of the same library.I just noticed that in SparkGenEnrichedOrcidWorks.java the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a
SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>
whereeu.dnetlib.dhp.schema.oaf.Publication
)eu.dnetlib.dhp.schema.action.AtomicAction
s, a simple wrapper class with just two fields: 1)Class<T> clazz
; and 2)T payload
; whereT extends Oaf
.I just noticed that in SparkGenEnrichedOrcidWorks.java the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a
SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>
whereeu.dnetlib.dhp.schema.oaf.Publication
)eu.dnetlib.dhp.schema.action.AtomicAction
s, a simple wrapper class with just two fields: 1)Class<T> clazz
; and 2)T payload
; whereT extends Oaf
.@ -0,0 +128,4 @@
.createDataset(
oafPublicationRDD.repartition(1).rdd(),
Encoders.bean(Publication.class));
publicationDataset
I just noticed the output format being set to parquet. As the records in this set must be integrated in the graph via the so called Actions Management system, the data created by this procedure should comply with the input format & model it requires, i.e. a
SequenceFile<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>
whereeu.dnetlib.dhp.schema.oaf.Publication
)eu.dnetlib.dhp.schema.action.AtomicAction
s, a simple wrapper class with just two fields: 1)Class<T> clazz
; and 2)T payload
; whereT extends Oaf
.Now the dataset output is in the correct format.