DUMP #212

miriam.baglioni · 2022-04-13T18:16:56+02:00

miriam.baglioni commented

2022-04-13 18:16:56 +02:00

This PR changes the way to get products associated to funder and also adds the code to generate the delta for the projects since the last graph creation in production

miriam.baglioni added the

enhancement

label 2022-04-13 18:16:56 +02:00

alessia.bardi was assigned by miriam.baglioni

2022-04-13 18:16:56 +02:00

claudio.atzori was assigned by miriam.baglioni

2022-04-13 18:16:56 +02:00

miriam.baglioni added 10 commits 2022-04-13 18:16:57 +02:00

faf79db4d5 [Dump Funders] -

13d1d73b2e [Dump Funders] -

9ba598a9b5 [Dump Funders] -

5331dea71b [Dump Funders] new code for the dump od products related to funders

f738acb85a [Dump Funders] new code for the dump of products related to funders

c0dab69349 [Graph DUMP] removed not used sub workflow

b0f0ae180c merge branch with master

1e251d34a1 [Graph DUMP] add code to produce the delta of new projects with respect to the previous delta/dump

1a0615125f [Graph DUMP] fixed issue in workflow

38b8d324af merge branch with master

alessia.bardi reviewed 2022-04-13 18:40:19 +02:00

alessia.bardi left a comment

Miriam, let's talk again about the project diff because there is something unclear to me as you can read in one of my comments

dhp-workflows/dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/dump/funderresults/SparkDumpFunderResults.java

						
				@ -21,3 +25,4 @@

				import eu.dnetlib.dhp.schema.dump.oaf.community.Funder;

				import eu.dnetlib.dhp.schema.dump.oaf.community.Project;

				/**

alessia.bardi commented

2022-04-13 18:32:29 +02:00

Why the changes in this class? Isn't the new class SparkDumpFunderresults2 that implements the new approach?

dhp-workflows/dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/dump/projectssubset/ProjectsSubsetSparkJob.java

						
				@ -0,0 +66,4 @@

							});

					}

					private static void getNewProjectList(SparkSession spark, String inputPath, String outputPath,

alessia.bardi commented

2022-04-13 18:35:25 +02:00

Make more explicit which is the old and which is the new set of projects

miriam.baglioni commented

2022-04-20 17:40:01 +02:00

In which way is it not clear?

dhp-workflows/dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/dump/projectssubset/ProjectsSubsetSparkJob.java

						
				@ -0,0 +75,4 @@

						Dataset<Project> projects;

						projects = Utils.readPath(spark, inputPath, Project.class);

						projects

alessia.bardi commented

2022-04-13 18:39:25 +02:00

Does it mean that the output is always the diff wrt the latest published dump?
When we talked I understood that you plan to deposit incremental diffs every time, but I do not see how you create the proper input for that

Does it mean that the output is always the diff wrt the latest published dump? When we talked I understood that you plan to deposit incremental diffs every time, but I do not see how you create the proper input for that

miriam.baglioni commented

2022-04-13 20:37:19 +02:00

ProjectListPath contains the list of projects we have already sent to Zenodo, while projects contains the new dumped set.
We do a left join between the two datasets, and when there is no match for a new project id, then this is one project to be returned.
The set of new projects to be returned is then read again and each of its id are write in append to the projectList so each time we consider the whole set of projects given to Zenodo and we provide only the news

ProjectListPath contains the list of projects we have already sent to Zenodo, while projects contains the new dumped set. We do a left join between the two datasets, and when there is no match for a new project id, then this is one project to be returned. The set of new projects to be returned is then read again and each of its id are write in append to the projectList so each time we consider the whole set of projects given to Zenodo and we provide only the news

alessia.bardi marked this conversation as resolved

miriam.baglioni added 93 commits 2022-06-06 11:54:57 +02:00

79336d46c5 [Clean Context] first naive implementation of a functionality to clean not wanted contextes from one result. This implementation simply verifies the main title of the results start with a given string

c442c91f89 computing stats in each step

4190c9f6bc [graph raw] avoid NPEs importing datasource consent fields

91e32f12ed Merge branch 'master' into beta

4eff7856f5 Merge pull request '[stats-wf] computing stats in each step' (#210 ) from antonis.lempesis/dnet-hadoop:beta into beta

Reviewed-on: #210

48b580b45c [graph enrichment] fixed country_propagation oozie workflow definition, parameter saveGraph is not needed anymore by the SparkCountryPropagationJob

73c172926a [Doiboost] fixed fundingReference extraction from the Crossref records

8e8933d41a [BulkTagging] added fix if result.dataInfo is null

c5a863132c [BulkTagging] revert it

0012e57bf9 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop

4314db55c8 migration to services: update sql queries

c96a8613f8 update SQL queries

b7cd2c6ca1 added open citations

869407c6e2 [Measures] added new measure (usagecounts) as action set. Measure added at the level of the result. Ref #7587

5feae77937 [Measures] last changes to accomodate tests

dbfbe8841a [Clean Context] changed the description in input parameters

a38f0f5ea7 mergin with branch beta

61c0266a44 Merge pull request 'Remove Context from result' (#208 ) from cleancontext into beta

Reviewed-on: #208

5295effc96 [Measures] fixed issue

b33156c2ee [Dump] remove non needed class

c304657d91 [Measures] put the logic in common, no need to change the schema

88acad76f9 Merge branch 'beta' into eosc_dimitris

d012d125d7 [EOSCTag] -

b61efd613b [Measures] addressed comments in the PR

bebb2a0560 Merge branch 'eosc_dimitris' of https://code-repo.d4science.org/D-Net/dnet-hadoop into eosc_dimitris

20de75ca64 [Measures] removed typo

a289c9eae2 Merge pull request '[Measures] added new measure (UsageCounts)' (#214 ) from eosc_dimitris into beta

Reviewed-on: #214

ccba1a3db1 [Clean Context] added logic to cleaning workflow to accomodate also context cleaning

5b7d9e741c [Clean Context] added logic to cleaning workflow to accomodate also context cleaning

29150a5d0c code formatting

9a961a0092 [Clean Context] fixed issue in param name

6dc68c48e0 [EOSCTag] -

e0915061c2 [Clean Context] fixed issue in param name

7cb7066472 [EoscTag] first "rough" implementation

aa12429f50 Modified last intersection since we lost many titles.

a82ec3aaaf code formatter

bbb77052d3 [EOSCTag] first test

54162f5c4f Merge branch 'beta' into cleancontext

19d90658fc [Clean Context] added description to parameters

911ce0780a Merge branch 'cleancontext' of https://code-repo.d4science.org/D-Net/dnet-hadoop into cleancontext

81242538e6 Merge pull request 'Oozie workflow for cleancontext' (#216 ) from cleancontext into beta

Reviewed-on: #216

Looks good. We need to extend the cleaning workflow parameters to enable the extra step only when it is needed.

87bff36d9e mergin with branch beta

27c85e901a [EOSCTag] added resources and finalized test for Jupyter Notebook tagging

dfbd2bcbea [EOSC TAG] added logic in case subject is null

88562c0930 [EOSC TAG] added test for galaxy for title and description criterias

e342ec93f0 [EOSCTag] prepared resources for test

8c22e5c30a added fix to include date array with only year or year and month

78015a5733 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta

5ffc24d1ba EOSC Services - ongoing update

f5f532d134 EOSC Services - ongoing update

05c1ea92e9 EOSC Services - added Service-specific fields in the XML record serialization

a8c51f6f16 EOSC Services - fixed query and testing preparation

e37177e1ce mergin with branch beta

b6a7ff3a99 EOSC Services - removed fields from mapping, testing preparation

2ade69dea6 EOSC Services - minor

a21fe310e5 [EOSCTag] last test and change in the implementation to search in title and descriptio

9e12cb3c92 EOSC Services - removed field knowledgegraph; depending on the released schema module

da611cfbbd [eosc_services] resolved merge conflicts

3aeedd931a [EOSCTag] fixed issue in case description is null. Modified test resources and classes

bd1108f98b mergin with branch beta

8a72de4011 [EOSCTag] modified workflow to execute all the steps and not only the last one

5fe25cc51c Merge pull request '[eosc tag] set the eosc subjects, rough implementation' (#215 ) from eosc_tag into beta

Reviewed-on: #215

846975c886 [eosc_services] using the correct 'keyword' subject type, as declared in the dnet:subject_classification_typologies vocabulary

658450d9a3 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta

cfbbcaf7c4 commented out indi_result_org_country_collab

61b4c19e65 restored indi_result_org_country_collab, removed indi_result_org_collab

a056f59c6e [UsageCount] make it as an action set as it should be, plus changed the test to make them work as well now

89657a0b78 [UsageCount] refactoring

378020e30a [eosc_services] unit test adaptation

77bc9863e9 [openorgs] mapping parent/child relations without massaging the semantic labels

23334479bb removed yet another collab, added more orgs in monitor

5d3b4a9c25 [graph merge beta] merge datasource originalid, collectedfrom, and pid lists

ca8d26bcb4 added better filter for openCitations

22f65680b9 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta

ba642d53ff mergin with branch beta

c25134f28d fixed typo

e4eac1d20b [EOSC TAG] added code to remove EOSC Jupyter Notebook from subjects and put EOSC as classid in the qualifier

3fc9efeab6 fixed typo, addded open citations and apcs in monitor

8160763330 fixed conflict

0dc33ea391 [openorgs] fixed parent/child query, using the correct semantic labels

4c50f35c8b update publication Date format

c1971d52c4 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta

997c50078e [graph grouping] drop relation target path before copying from source

6442763f97 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta

d098ad0d93 [hb patch] updated map

f5207885e3 [EOSCTag] changed code to remove EOSC Jupyter Notebook and modified test to exclude galaxy + software from the tagging for Galaxy

eaf9385ae5 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta

c298c148cb [CountryPropagation] fix NPE issue

5e0b8f9b5f [CountryPropagation] refactoring

5c2949a864 Merge pull request '[stats wf] added open citations & more orgs in monitor, removed collab indicator' (#213 ) from antonis.lempesis/dnet-hadoop:beta into beta

Reviewed-on: #213

108e17644e mergin with branch beta

31d4557e8d Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop

2a77ebb431 resolving conflicts

miriam.baglioni added 5 commits 2022-06-21 18:09:41 +02:00

2d54a68cde merge branch with master

fdde309f59 refactoring

8d372f1be7 refactoring

d62ca8392b [DUMP] change in the community subworkflow workflow to remove the no more needed subworkflow in common with funders dump

e9384526c6 merge branch with master

claudio.atzori closed this pull request

2022-08-01 10:14:41 +02:00

Pull request closed

Please reopen this pull request to perform a merge.

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#212