DUMP #212
Closed
miriam.baglioni
wants to merge 104 commits from
dump
into master
pull from: dump
merge into: D-Net:master
D-Net:master
D-Net:spark34-integration
D-Net:beta_indexing_May2024
D-Net:beta
D-Net:irish-oaipmh-exporter
D-Net:rest-collector-plugin-with-retry
D-Net:beta_provision_relation
D-Net:dependency-revision
D-Net:beta-release-1.2.5
D-Net:misc_fixes_merge_entities
D-Net:WebCrowlBeta
D-Net:WebCrowl
D-Net:provision_memoryOverhead
D-Net:stats_step16_fix
D-Net:doidoost_dismiss
D-Net:CrossrefFundersMap
D-Net:taggingProjects
D-Net:9647_datacite_affiliations
D-Net:UsageStatsRecordDS
D-Net:mergeutils
D-Net:oaf_country_beta
D-Net:index_records
D-Net:ocnew
D-Net:FOSNew
D-Net:bulkTaggingPathMapExtention
D-Net:transformativeagreement
D-Net:new_orcid_enhancement
D-Net:9559_DBLP_data
D-Net:base_stats_job_deprecated
D-Net:SWH_issue_377
D-Net:import_orps_fix
D-Net:spark_join_param_tuning
D-Net:crossref_mapping_vocabularies
D-Net:promote_actions_join_type_master
D-Net:promote_actions_join_type
D-Net:provision_community_api
D-Net:enrichmentSingleStepFixed
D-Net:fosPreparationBeta
D-Net:resource_types
D-Net:enrichmentSingleStep
D-Net:oldPropagationOrganizationCommunity
D-Net:beta_to_master_dicember2023
D-Net:orcid_import
D-Net:9078_xml_records_irish_tender
D-Net:clean_license_publisher
D-Net:bulkTag
D-Net:SWH_integration
D-Net:importpoci
D-Net:8172_impact_indicators_workflow
D-Net:dedup-with-dataframe-spark34
D-Net:8876
D-Net:master_july23
D-Net:distinct_pids_from_openorgs_beta
D-Net:propagationProjectThroughParentChils
D-Net:fulltext_url_validation
D-Net:removeTaggingCondition
D-Net:ticket_8369
D-Net:tweaking_spark_params
D-Net:fc4e-rsac
D-Net:doiboost_authormerger
D-Net:beta_dedup_configuration
D-Net:apc_affiliation
D-Net:bulkTagRefactor
D-Net:organizationToRepresentative
D-Net:graph_cleaning_refactoring
D-Net:scholix_flat_indexing
D-Net:scholix_data_type_openaire
D-Net:advConstraintsInBeta
D-Net:doiboostMappingExtention
D-Net:mag_citation_relation
D-Net:h2020classification
D-Net:doiboostFunderExtention
D-Net:citations_monodirectional
D-Net:compatibility_order
D-Net:8232-mdstore-synch-improve
D-Net:subjectPropagation
D-Net:pubmed_to_production
D-Net:cleanCountryOnMaster
D-Net:graph_cleaning
D-Net:deduptesting
D-Net:horizontalConstraints
D-Net:enrichment
D-Net:scholix_to_solr
D-Net:transformation_wf
D-Net:discard-non-wellformed
D-Net:removeDump
D-Net:eosc_context_tagging
D-Net:pubmed_update
D-Net:doiboost_refactor
D-Net:clean_context_master
D-Net:monitoring
D-Net:dump_new_funded_products
D-Net:dump_delta_projects
D-Net:7096-fileGZip-collector-plugin
D-Net:oaf_relation_mapping
D-Net:validation
D-Net:native_records_migration
D-Net:instance_group_by_url
D-Net:hostedByMap_update
D-Net:hostedByMap_oastartdate
D-Net:sygma_indexing
No reviewers
Labels
Clear labels
Something is not working
This issue or pull request already exists
New feature / refactoring
Need some help
Something is wrong
More information is needed
EOSC Research Discovery Graph
EOSC Research Software APIs and Connectors
This won't be fixed
bug
Something is not working
duplicate
This issue or pull request already exists
enhancement
New feature / refactoring
help wanted
Need some help
invalid
Something is wrong
question
More information is needed
RDGraph
EOSC Research Discovery Graph
RSAC
EOSC Research Software APIs and Connectors
wontfix
This won't be fixed
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
Milestone
Clear milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
Clear assignees
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#212
Reference in New Issue
No description provided.
Delete Branch "dump"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR changes the way to get products associated to funder and also adds the code to generate the delta for the projects since the last graph creation in production
Miriam, let's talk again about the project diff because there is something unclear to me as you can read in one of my comments
@ -21,3 +25,4 @@
import eu.dnetlib.dhp.schema.dump.oaf.community.Funder;
import eu.dnetlib.dhp.schema.dump.oaf.community.Project;
/**
Why the changes in this class? Isn't the new class SparkDumpFunderresults2 that implements the new approach?
@ -0,0 +66,4 @@
});
}
private static void getNewProjectList(SparkSession spark, String inputPath, String outputPath,
Make more explicit which is the old and which is the new set of projects
In which way is it not clear?
@ -0,0 +75,4 @@
Dataset<Project> projects;
projects = Utils.readPath(spark, inputPath, Project.class);
projects
Does it mean that the output is always the diff wrt the latest published dump?
When we talked I understood that you plan to deposit incremental diffs every time, but I do not see how you create the proper input for that
ProjectListPath contains the list of projects we have already sent to Zenodo, while projects contains the new dumped set.
We do a left join between the two datasets, and when there is no match for a new project id, then this is one project to be returned.
The set of new projects to be returned is then read again and each of its id are write in append to the projectList so each time we consider the whole set of projects given to Zenodo and we provide only the news
Pull request closed