DUMP #212

Closed
miriam.baglioni wants to merge 104 commits from dump into master

This PR changes the way to get products associated to funder and also adds the code to generate the delta for the projects since the last graph creation in production

This PR changes the way to get products associated to funder and also adds the code to generate the delta for the projects since the last graph creation in production
miriam.baglioni added the
enhancement
label 2022-04-13 18:16:56 +02:00
alessia.bardi was assigned by miriam.baglioni 2022-04-13 18:16:56 +02:00
claudio.atzori was assigned by miriam.baglioni 2022-04-13 18:16:56 +02:00
miriam.baglioni added 10 commits 2022-04-13 18:16:57 +02:00
alessia.bardi reviewed 2022-04-13 18:40:19 +02:00
alessia.bardi left a comment
Owner

Miriam, let's talk again about the project diff because there is something unclear to me as you can read in one of my comments

Miriam, let's talk again about the project diff because there is something unclear to me as you can read in one of my comments
@ -21,3 +25,4 @@
import eu.dnetlib.dhp.schema.dump.oaf.community.Funder;
import eu.dnetlib.dhp.schema.dump.oaf.community.Project;
/**

Why the changes in this class? Isn't the new class SparkDumpFunderresults2 that implements the new approach?

Why the changes in this class? Isn't the new class SparkDumpFunderresults2 that implements the new approach?
@ -0,0 +66,4 @@
});
}
private static void getNewProjectList(SparkSession spark, String inputPath, String outputPath,

Make more explicit which is the old and which is the new set of projects

Make more explicit which is the old and which is the new set of projects
Author
Member

In which way is it not clear?

In which way is it not clear?
@ -0,0 +75,4 @@
Dataset<Project> projects;
projects = Utils.readPath(spark, inputPath, Project.class);
projects

Does it mean that the output is always the diff wrt the latest published dump?
When we talked I understood that you plan to deposit incremental diffs every time, but I do not see how you create the proper input for that

Does it mean that the output is always the diff wrt the latest published dump? When we talked I understood that you plan to deposit incremental diffs every time, but I do not see how you create the proper input for that
Author
Member

ProjectListPath contains the list of projects we have already sent to Zenodo, while projects contains the new dumped set.
We do a left join between the two datasets, and when there is no match for a new project id, then this is one project to be returned.
The set of new projects to be returned is then read again and each of its id are write in append to the projectList so each time we consider the whole set of projects given to Zenodo and we provide only the news

ProjectListPath contains the list of projects we have already sent to Zenodo, while projects contains the new dumped set. We do a left join between the two datasets, and when there is no match for a new project id, then this is one project to be returned. The set of new projects to be returned is then read again and each of its id are write in append to the projectList so each time we consider the whole set of projects given to Zenodo and we provide only the news
alessia.bardi marked this conversation as resolved
miriam.baglioni added 93 commits 2022-06-06 11:54:57 +02:00
81242538e6 Merge pull request 'Oozie workflow for cleancontext' (#216) from cleancontext into beta
Reviewed-on: #216

Looks good. We need to extend the cleaning workflow parameters to enable the extra step only when it is needed.
miriam.baglioni added 5 commits 2022-06-21 18:09:41 +02:00
claudio.atzori closed this pull request 2022-08-01 10:14:41 +02:00

Pull request closed

Sign in to join this conversation.
No description provided.