[graph dedup] records stability and testing #44
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#44
Loading…
Reference in New Issue
No description provided.
Delete Branch "deduptesting"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
@ -0,0 +38,4 @@
}
//pick the best pid from the entity. Returns a list (length 1) to save time in the call
public static <T extends OafEntity> List<Identifier> bestPidtoIdentifier(T entity) {
Minor: camel casing should be
bestPidToIdentifier
@ -0,0 +121,4 @@
}
public boolean isFromDatasourceID(List<KeyValue> collectedFrom, String dsId){
given that this method is used several time by the caller
compareTo
, probably it is worth extracting thecollectedfrom.key
from boththis
andi
defining them intoHashSet<String>
. In this way, lookups for thedsId
would be more efficient.@ -0,0 +65,4 @@
}
@Override
void run(ISLookUpService isLookUpService) {
is the
isLookUpService
actually used? If not, please remove itThe code overall looks good, please address the comment I dropped inline, below.
Moreover, consider to cleanup the hidden files from
their name starts with a dot.
@ -0,0 +36,4 @@
}
public void setPid(StructuredProperty pidValue) {
this.pid = pid;
Either rename the parameter
pidValue
topid
, or assign the local variablepid
withpidValue
.@ -0,0 +34,4 @@
if (bp.get().isUseOriginal() || bp.get().getPid().getValue() == null) {
return bp.get().getOriginalID().split("\\|")[0] + "|dedup_wf_001::"
+ DedupUtility.md5(bp.get().getOriginalID());
Use the common utility
eu.dnetlib.dhp.utils.DHPUtils.md5
, it has exactly the same implementation. Consider also to remove the methodmd5(String)
from that class.@ -0,0 +109,4 @@
return -1;
}
if (this.getDate().compareTo(date) == 0) {// same date
this statement is comparing the the date field with itself. I suppose it was intended to be:
this.getDate().compareTo(i.getDate()) == 0
@ -0,0 +120,4 @@
return -this.originalID.compareTo(i.originalID);
} else
// the minus is because we need to take the elder date
return -this.getDate().compareTo(date);
same issue here, the local date is being compared with itself instead with the other from the variable i.
@ -0,0 +4,4 @@
public enum PidType {
// from the less to the more important
undefined, original, orcid, ror, grid, pdb, arXiv, pmid, doi;
I don't see pmc identifiers here. Did we decide to not consider them?
FYI: here's the full list of the ID types http://api.openaire.eu/vocabularies/dnet:pid_types
@michele.debonis please have a look at the further comments I added today
@ -0,0 +48,4 @@
public static <T extends OafEntity> ArrayList<Identifier> createBasePid(T entity, SimpleDateFormat sdf) {
Date date;
This code block can be defined statically outside from this method as it doesn't depend on any variable but only relies on the constant
BASE_DATE
.@ -0,0 +63,4 @@
// pick the best pid from the entity. Returns a list (length 1) to save time in the call
public static <T extends OafEntity> List<Identifier> bestPidToIdentifier(T entity) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
Probably
sdf
can be defined statically too, outside from this method.Probably
sdf
can be defined statically too, outside from this method.83c90c7180
4094f2bb9a
8de9788308
9f3036c847
a6977197b3
deduptestingto [graph dedup] records stability and testingI revamp this PR to improve the stability of the properties used to build the representative records. So far they were set according to the result
datainfo.trust
and delegating the merging to theOaf::mergeFrom
method. Few exceptions were defined for manageing the definition of the set ofauthors
and thedateofacceptance
, whose value doesn't depend on the trust.In fact, as the vast majority of the records share the same trust (
0.9
), in practice the selection and the prevalence of one value over another was not really predictable, especially for the non repeatible fields whose values, once set, would onyl be updated from records with higher trust.So, this PR introduces a change in the field merging procedure, making it use the same ordering criteria used to select the record whose identifier is the basis for the construction of the root's record ID.
This ordering criteria is implemented in
eu.dnetlib.dhp.oa.dedup.IdentifierComparator
and now is used in the
eu.dnetlib.dhp.oa.dedup.DedupRecordFactory
to sort the duplicates before starting to build the root record properties.Moreover, this PR introduces a more extensive testing case to verify that the root records preserve their property values also when the elements in the duplicates groups changes. This test is implemented in
eu.dnetlib.dhp.oa.dedup.SparkPublicationRootsTest