[WebCrawl] adding affiliation relations from web information #428

miriam.baglioni · 2024-04-22T11:06:44+02:00

miriam.baglioni commented

2024-04-22 11:06:44 +02:00

This PR adds affiliation information from web resources. It takes everythong for IE and up to 2021 for the other countries

claudio.atzori was assigned by miriam.baglioni

2024-04-22 11:06:44 +02:00

miriam.baglioni added 1 commit 2024-04-22 11:06:45 +02:00

776c898c4b [WebCrawl] adding affiliation relations from web information

claudio.atzori added 1 commit 2024-04-22 11:40:25 +02:00

eb4692e4ee Merge branch 'beta' into WebCrowlBeta

claudio.atzori requested changes 2024-04-22 12:22:16 +02:00

claudio.atzori left a comment

Please address the inline comments

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/webcrawl/CreateActionSetFromWebEntries.java Outdated

						
				@ -0,0 +78,4 @@

				                spark -> {

				                    createActionSet(spark, inputPath, outputPath + "actionSet");

				                    createPlainRelations(spark, inputPath, outputPath + "relations");

claudio.atzori commented

2024-04-22 12:22:02 +02:00

The output path is controlled by the actionset management framework and should not be altered by the workflow. Hence consider to place them elsewhere under a different root directory. Also the output of the createActionSet function should be placed directly inside outputPath.

The output path is controlled by the actionset management framework and should not be altered by the workflow. Hence consider to place them elsewhere under a different root directory. Also the output of the `createActionSet` function should be placed directly inside `outputPath`.

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/webcrawl/CreateActionSetFromWebEntries.java Outdated

						
				@ -0,0 +197,4 @@

				        return createAffiliatioRelationPair(

				                PMCID_PREFIX

				                        + IdentifierFactory

				                        .md5(PidCleaner.normalizePidValue(PidType.pmc.toString(), "PMC" + pmcid.substring(43))),

claudio.atzori commented

2024-04-22 12:18:08 +02:00

It seems that PIDs found in the input dataset include their resolver. As we need them without it, it should be better to make this manipulation explicit.