[graph dedup] records stability and testing #44

Merged
claudio.atzori merged 9 commits from deduptesting into beta 2022-12-02 14:43:06 +01:00
  • implementation of the new procedure to generate the id of the dedup record
  • implementation of the job to create simrels from the entries in the postgres db
* implementation of the new procedure to generate the id of the dedup record * implementation of the job to create simrels from the entries in the postgres db
claudio.atzori reviewed 2020-09-22 12:36:03 +02:00
@ -0,0 +38,4 @@
}
//pick the best pid from the entity. Returns a list (length 1) to save time in the call
public static <T extends OafEntity> List<Identifier> bestPidtoIdentifier(T entity) {

Minor: camel casing should be bestPidToIdentifier

Minor: camel casing should be `bestPidToIdentifier`
claudio.atzori reviewed 2020-09-22 14:15:19 +02:00
@ -0,0 +121,4 @@
}
public boolean isFromDatasourceID(List<KeyValue> collectedFrom, String dsId){

given that this method is used several time by the caller compareTo, probably it is worth extracting the collectedfrom.key from both this and i defining them into HashSet<String>. In this way, lookups for the dsId would be more efficient.

given that this method is used several time by the caller `compareTo`, probably it is worth extracting the `collectedfrom.key` from both `this` and `i` defining them into `HashSet<String>`. In this way, lookups for the `dsId` would be more efficient.
claudio.atzori reviewed 2020-09-22 14:17:56 +02:00
@ -0,0 +65,4 @@
}
@Override
void run(ISLookUpService isLookUpService) {

is the isLookUpService actually used? If not, please remove it

is the `isLookUpService` actually used? If not, please remove it
claudio.atzori requested changes 2020-09-22 14:21:58 +02:00
claudio.atzori left a comment
Owner

The code overall looks good, please address the comment I dropped inline, below.

Moreover, consider to cleanup the hidden files from

dhp-dedup-openaire/src/test/resources/eu/dnetlib/dhp/dedup/assertions/groups

their name starts with a dot.

The code overall looks good, please address the comment I dropped inline, below. Moreover, consider to cleanup the hidden files from ``` dhp-dedup-openaire/src/test/resources/eu/dnetlib/dhp/dedup/assertions/groups ``` their name starts with a dot.
claudio.atzori added the
enhancement
label 2020-10-02 11:58:18 +02:00
claudio.atzori reviewed 2020-10-05 10:39:16 +02:00
@ -0,0 +36,4 @@
}
public void setPid(StructuredProperty pidValue) {
this.pid = pid;

Either rename the parameter pidValue to pid, or assign the local variable pid with pidValue.

Either rename the parameter `pidValue` to `pid`, or assign the local variable `pid` with `pidValue`.
claudio.atzori reviewed 2020-10-05 10:42:46 +02:00
@ -0,0 +34,4 @@
if (bp.get().isUseOriginal() || bp.get().getPid().getValue() == null) {
return bp.get().getOriginalID().split("\\|")[0] + "|dedup_wf_001::"
+ DedupUtility.md5(bp.get().getOriginalID());

Use the common utility eu.dnetlib.dhp.utils.DHPUtils.md5, it has exactly the same implementation. Consider also to remove the method md5(String) from that class.

Use the common utility `eu.dnetlib.dhp.utils.DHPUtils.md5`, it has exactly the same implementation. Consider also to remove the method `md5(String)` from that class.
claudio.atzori reviewed 2020-10-05 11:15:07 +02:00
@ -0,0 +109,4 @@
return -1;
}
if (this.getDate().compareTo(date) == 0) {// same date

this statement is comparing the the date field with itself. I suppose it was intended to be:

this.getDate().compareTo(i.getDate()) == 0

this statement is comparing the the date field with itself. I suppose it was intended to be: ```this.getDate().compareTo(i.getDate()) == 0```
claudio.atzori reviewed 2020-10-05 11:30:37 +02:00
@ -0,0 +120,4 @@
return -this.originalID.compareTo(i.originalID);
} else
// the minus is because we need to take the elder date
return -this.getDate().compareTo(date);

same issue here, the local date is being compared with itself instead with the other from the variable i.

same issue here, the local date is being compared with itself instead with the other from the variable _i_.
claudio.atzori reviewed 2020-10-05 14:59:33 +02:00
@ -0,0 +4,4 @@
public enum PidType {
// from the less to the more important
undefined, original, orcid, ror, grid, pdb, arXiv, pmid, doi;

I don't see pmc identifiers here. Did we decide to not consider them?

FYI: here's the full list of the ID types http://api.openaire.eu/vocabularies/dnet:pid_types

I don't see _pmc_ identifiers here. Did we decide to not consider them? FYI: here's the full list of the ID types http://api.openaire.eu/vocabularies/dnet:pid_types
claudio.atzori requested changes 2020-10-05 15:00:08 +02:00
claudio.atzori left a comment
Owner

@michele.debonis please have a look at the further comments I added today

@michele.debonis please have a look at the further comments I added today
claudio.atzori reviewed 2020-10-09 11:56:20 +02:00
@ -0,0 +48,4 @@
public static <T extends OafEntity> ArrayList<Identifier> createBasePid(T entity, SimpleDateFormat sdf) {
Date date;

This code block can be defined statically outside from this method as it doesn't depend on any variable but only relies on the constant BASE_DATE.

This code block can be defined statically outside from this method as it doesn't depend on any variable but only relies on the constant `BASE_DATE`.
claudio.atzori reviewed 2020-10-09 11:57:35 +02:00
@ -0,0 +63,4 @@
// pick the best pid from the entity. Returns a list (length 1) to save time in the call
public static <T extends OafEntity> List<Identifier> bestPidToIdentifier(T entity) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");

Probably sdf can be defined statically too, outside from this method.

Probably `sdf` can be defined statically too, outside from this method.
claudio.atzori reviewed 2020-10-09 11:57:36 +02:00

Probably sdf can be defined statically too, outside from this method.

Probably `sdf` can be defined statically too, outside from this method.
claudio.atzori added 2223 commits 2022-11-07 12:23:17 +01:00
faa977df7e Merge pull request 'orcid-no-doi' (#43) from enrico.ottonello/dnet-hadoop:orcid-no-doi into master
The dataset was generated and is now part of the actionsets available in BETA
0276180039 WIP mdstore
transaction implemented on hadoop side
b3f5c2351d Merge branch 'hadoop_aggregator' of code-repo.d4science.org:D-Net/dnet-hadoop into hadoop_aggregator
 Conflicts:
	dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/transformation/TransformationJobTest.java
e8789b0cdb Merge pull request 'stats DB for monitor' (#99) from antonis.lempesis/dnet-hadoop:master into master
Looks good to me, just a note on the parsing of the citations: since the last version, IIS produces citations as proper relationships among results. This is what we got already in the BETA graph

```
count		r.reltype	r.subreltype	r.relclass
62.129.254	resultResult	citation	cites
62.043.309	resultResult	citation	isCitedBy
```

Thus, I suggest to move away from the current property based implementation for the extraction of the citation links and start relying on the relationships instead.
37b65cc3ad Merge pull request 'updates on stats-update workflow' (#100) from antonis.lempesis/dnet-hadoop:master into master
The workflow integrated in the _stable_ids_ branch has been run correctly on the BETA content, thus IMO this PR can be integrated in the master branch.

Reviewed-on: #100
073dcea2aa Added passing of the following parameters:
-varDataSourceId
-varOfficialName

in Each transformation Rule
6424cd9062 Added passing of the following parameters:
-varDataSourceId
-varOfficialName

in Each transformation Rule
2e70aa43f0 Merge pull request 'H2020Classification fix and possibility to add datasources in blacklist for propagation of result to organization' (#108) from miriam.baglioni/dnet-hadoop:master into master
Reviewed-on: #108

The changes look ok, but please drop a comment to describe how the parameters should be changed from the workflow caller for both workflows
* H2020Classification
* propagation of result to organization
bc12e9819e Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.
f33521d338 Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
to be able to replace the aboject assigned to author val has been replaced by var
32b0c27217 Aggiornare 'dhp-workflows/dhp-enrichment/src/main/java/eu/dnetlib/dhp/resulttoorganizationfrominstrepo/PrepareResultInstRepoAssociation.java'
fix in SQL query: while writing the blacklist constraint it used d.id to indicate the datasource id, but no alias for the datasource was defined. So I removed the alias
dd997c49e0 fix wrong relation id
fix date thai ticket #6791
c6fa8598e1 massive code refactor:
removed modules dhp-*-scholexplorer
4cb65bc64a fixed process doiboost workflow:
- splitted OrcidToOAF into two phase preprocess and process
- updated workflow used in production
c35c117601 fixed process doiboost workflow:
- splitted OrcidToOAF into two phase preprocess and process
- updated workflow used in production
3d8e2aa146 Code refactor:
- removed old workflows in doiboost
 - splitted workflow of doiboost in preprocess and process
10068c00ea Code refactor:
- removed old workflows in doiboost
 - splitted workflow of doiboost in preprocess and process
511da98d0c - fixed bug on download pmc Article
- removed unused line of code in SparkCreateActionset
ae4e99a471 Adapted workflow of resolution of PID to work into OpenAIRE data workflow
- Added relations in both verse on all Scholexplorer datasources
4acfa8fa2e Scholexplorer Datasource Aggregation:
- Added collectedfrom in the inverse relation generated
Relation resolution:
- increased number of partitions in workflow.xml
- using classid instead of classname to build the pid-dnetId mapping
793b5a8e5f Aggiornare 'dhp-workflows/dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/dump/ResultMapper.java'
Removing the dump of Measure at the level of the result. We decided not to map it
35e20b0647 updated resolution wf:
- generate a new version of the graph
 - changed merge from union to join
09fc2afdca Added indi_funder_country_collab
Kept only indi_pub_has_cc_licence
7af0bbd0b1 [scala-refactor] Module dhp-aggregation:
Moved all scala source into src/main/scala and src/test/scala
81bf604059 [scala-refactor] Module dhp-common:
Moved all scala source into src/main/scala and src/test/scala
bf880e2508 [scala-refactor] Module dhp-graph-mapper:
Moved all scala source into src/main/scala and src/test/scala
e9f285ec4d [scala-refactor] Module dhp-doiboost:
Moved all scala source into src/main/scala and src/test/scala
e5bff64f2e [scholexplorer]
- Minor fix on SparkConvertRDDtoDataset
-first implementation of retrieve datacite dump
63952018c0 [scholexplorer]
-moved SparkRetrieveDataciteDelta in scala folder
b881ee5ef8 [scholexplorer]
- implemented generation of scholix of delta update of datacite
0163dadb7f [doiboost]
- update MAG schema, new filed added on version dec-2021
d580e15442 Modified last intersection since we lost many titles.
this is my last resource, after that, I've to  change my job
81242538e6 Merge pull request 'Oozie workflow for cleancontext' (#216) from cleancontext into beta
Reviewed-on: #216

Looks good. We need to extend the cleaning workflow parameters to enable the extra step only when it is needed.
9886fe87ec - Added FOS classification
- Added extra orgs in monitor
- Fixed result-project and organization-project tables
dcd85f8cd7 - Synchronize indicators in stats-db with monitor-db
- added new openorg id for Nanyang Technological University
- changed openorg id for University of Helsinki #8088 ticket
claudio.atzori added 2 commits 2022-11-11 16:10:31 +01:00
claudio.atzori added 2 commits 2022-11-17 16:13:50 +01:00
claudio.atzori changed target branch from master to beta 2022-11-22 10:21:11 +01:00
claudio.atzori added 1 commit 2022-11-22 10:21:32 +01:00
claudio.atzori changed title from deduptesting to [graph dedup] records stability and testing 2022-11-30 11:21:06 +01:00
claudio.atzori added 1 commit 2022-11-30 11:27:31 +01:00

I revamp this PR to improve the stability of the properties used to build the representative records. So far they were set according to the result datainfo.trust and delegating the merging to the Oaf::mergeFrom method. Few exceptions were defined for manageing the definition of the set of authors and the dateofacceptance, whose value doesn't depend on the trust.

In fact, as the vast majority of the records share the same trust (0.9), in practice the selection and the prevalence of one value over another was not really predictable, especially for the non repeatible fields whose values, once set, would onyl be updated from records with higher trust.

So, this PR introduces a change in the field merging procedure, making it use the same ordering criteria used to select the record whose identifier is the basis for the construction of the root's record ID.

This ordering criteria is implemented in

eu.dnetlib.dhp.oa.dedup.IdentifierComparator

and now is used in the eu.dnetlib.dhp.oa.dedup.DedupRecordFactory to sort the duplicates before starting to build the root record properties.

Moreover, this PR introduces a more extensive testing case to verify that the root records preserve their property values also when the elements in the duplicates groups changes. This test is implemented in

eu.dnetlib.dhp.oa.dedup.SparkPublicationRootsTest

I revamp this PR to improve the stability of the properties used to build the representative records. So far they were set according to the result `datainfo.trust` and delegating the merging to the `Oaf::mergeFrom` method. Few exceptions were defined for manageing the definition of the set of `authors` and the `dateofacceptance`, whose value doesn't depend on the trust. In fact, as the vast majority of the records share the same trust (`0.9`), in practice the selection and the prevalence of one value over another was not really predictable, especially for the non repeatible fields whose values, once set, would onyl be updated from records with higher trust. So, this PR introduces a change in the field merging procedure, making it use the same ordering criteria used to select the record whose identifier is the basis for the construction of the root's record ID. This ordering criteria is implemented in ```eu.dnetlib.dhp.oa.dedup.IdentifierComparator``` and now is used in the ```eu.dnetlib.dhp.oa.dedup.DedupRecordFactory``` to sort the duplicates before starting to build the root record properties. Moreover, this PR introduces a more extensive testing case to verify that the root records preserve their property values also when the elements in the duplicates groups changes. This test is implemented in ```eu.dnetlib.dhp.oa.dedup.SparkPublicationRootsTest```
claudio.atzori merged commit 41f7f1bbc5 into beta 2022-12-02 14:43:06 +01:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#44
No description provided.