dump of the OpenAIRE graph #40
Closed
miriam.baglioni
wants to merge 0 commits from
miriam.baglioni/dnet-hadoop:dump
into master
pull from: miriam.baglioni/dnet-hadoop:dump
merge into: D-Net:master
D-Net:main
D-Net:propagateorcid
D-Net:beta
D-Net:COnnectSubCommunities
D-Net:csv_collector_plugins
D-Net:kubernetes
D-Net:solr_v9
D-Net:affiliation-matching
D-Net:hive_graph_db
D-Net:cleaning_10215
D-Net:beta-ukripublication
D-Net:scholix_small_index
D-Net:Communities_patents
D-Net:crossref_mapping_improvement
D-Net:raid_actionset
D-Net:zenodo_dump_collection
D-Net:datacite_dump_import
D-Net:incremental_graph_main
D-Net:main_bulktag
D-Net:actionset_promote_sequential
D-Net:person_through_the_graph_relazioniattributate
D-Net:pubmed_fix
D-Net:dedup_new_comparators
D-Net:abtracts_guidelines4
D-Net:propagateorcid_publisher
D-Net:merge_resulttypes
D-Net:betaFixPerson
D-Net:k8s_orcidpropagation
D-Net:raw_affiliation_mapping
D-Net:mergedids
D-Net:beta_provision_alignment_9.0.0
D-Net:bestinstancetype
D-Net:person_through_the_graph
D-Net:person_through_the_graph_newDevelopments
D-Net:affroNewModelonBeta
D-Net:beta_fixes_oct
D-Net:dedup_local_test_implementation
D-Net:fix_decision_tree
D-Net:pid_cleaning
D-Net:affRoModelModificationOnmain
D-Net:beta2main_sept2024
D-Net:osfPreprints_plugin
D-Net:fix_missing_project_rels
D-Net:enrich_more_subject_bug
D-Net:gtr2Publications_plugin
D-Net:merge_by_id_fix
D-Net:dedup_blacklist_fix
D-Net:claim-orgs
D-Net:9126-impact-indicators-wf-optimisation
D-Net:AffiliationFromPublisher
D-Net:pubmed_aggregation_improvements
D-Net:betaAffiliationFromPublisherPages
D-Net:oozie_spark_params
D-Net:GraphAnnotation
D-Net:rest-collector-request-header-map2
D-Net:dedup_countryInference_NPE
D-Net:webCrawlLessBlackList
D-Net:fos_l1l2
D-Net:sdgnodoi
D-Net:person
D-Net:tagOrganization
D-Net:entity_contexts
D-Net:broker_orcid
D-Net:metadata_collection_java_upgrade
D-Net:research_fi_collector_plugin
D-Net:openorgs_optimization
D-Net:checkContextOnDatasourceAndProject
D-Net:affRoFromRawStringmain
D-Net:openorgs_fixes
D-Net:master
D-Net:9872-create-solr-collection-aliases
D-Net:json_payload
D-Net:pivotselectionbypid
D-Net:import_openorg_type
D-Net:irish-oaipmh-exporter
D-Net:fix_mergedcliquesort
D-Net:spark34-integration
D-Net:beta_indexing_May2024
D-Net:mag_only_doi
D-Net:rest-collector-plugin-with-retry
D-Net:dependency-revision
D-Net:beta-release-1.2.5
D-Net:misc_fixes_merge_entities
D-Net:WebCrowlBeta
D-Net:WebCrowl
D-Net:doidoost_dismiss
D-Net:CrossrefFundersMap
D-Net:taggingProjects
D-Net:9647_datacite_affiliations
D-Net:UsageStatsRecordDS
D-Net:mergeutils
D-Net:oaf_country_beta
D-Net:index_records
D-Net:9559_DBLP_data
D-Net:base_stats_job_deprecated
D-Net:SWH_issue_377
D-Net:spark_join_param_tuning
D-Net:crossref_mapping_vocabularies
D-Net:oldPropagationOrganizationCommunity
D-Net:orcid_import
D-Net:dedup-with-dataframe-spark34
D-Net:ticket_8369
D-Net:tweaking_spark_params
D-Net:fc4e-rsac
D-Net:doiboost_authormerger
D-Net:beta_dedup_configuration
D-Net:scholix_flat_indexing
D-Net:scholix_data_type_openaire
D-Net:subjectPropagation
D-Net:enrichment
D-Net:validation
No reviewers
Labels
Clear labels
Something is not working
This issue or pull request already exists
New feature / refactoring
Need some help
Something is wrong
More information is needed
EOSC Research Discovery Graph
EOSC Research Software APIs and Connectors
This won't be fixed
bug
Something is not working
duplicate
This issue or pull request already exists
enhancement
New feature / refactoring
help wanted
Need some help
invalid
Something is wrong
question
More information is needed
RDGraph
EOSC Research Discovery Graph
RSAC
EOSC Research Software APIs and Connectors
wontfix
This won't be fixed
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
Milestone
Clear milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
Clear assignees
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#40
Reference in New Issue
No description provided.
Delete Branch "miriam.baglioni/dnet-hadoop:dump"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR regards two dumps of the OpenAIRE research graph.
To provide this dumps external models (derived from the internal OpenAIRE one) have been defined.
The dhp-schemas module has been extended with new packages and classes:
To execute the dump for the products related to Research Communities - Research Infrastructures/Initiative the following actions are performed:
To execute the dump for the whole graph the following actions are performed:
@ -0,0 +29,4 @@
Assertions.assertEquals(200, client.uploadIS(is, "COVID-19.json.gz", file.length()));
String metadata = "{\"metadata\":{\"access_right\":\"open\",\"communities\":[{\"identifier\":\"openaire-research-graph\"}],\"creators\":[{\"affiliation\":\"ISTI - CNR\",\"name\":\"Bardi, Alessia\",\"orcid\":\"0000-0002-1112-1292\"},{\"affiliation\":\"eifl\", \"name\":\"Kuchma, Iryna\"},{\"affiliation\":\"BIH\", \"name\":\"Brobov, Evgeny\"},{\"affiliation\":\"GIDIF RBM\", \"name\":\"Truccolo, Ivana\"},{\"affiliation\":\"unesp\", \"name\":\"Monteiro, Elizabete\"},{\"affiliation\":\"und\", \"name\":\"Casalegno, Carlotta\"},{\"affiliation\":\"CARL ABRC\", \"name\":\"Clary, Erin\"},{\"affiliation\":\"The University of Edimburgh\", \"name\":\"Romanowski, Andrew\"},{\"affiliation\":\"ISTI - CNR\", \"name\":\"Pavone, Gina\"},{\"affiliation\":\"ISTI - CNR\", \"name\":\"Artini, Michele\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"Atzori, Claudio\",\"orcid\":\"0000-0001-9613-6639\"},{\"affiliation\":\"University of Bielefeld\",\"name\":\"Bäcker, Amelie\",\"orcid\":\"0000-0001-6015-2063\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"Baglioni, Miriam\",\"orcid\":\"0000-0002-2273-9004\"},{\"affiliation\":\"University of Bielefeld\",\"name\":\"Czerniak, Andreas\",\"orcid\":\"0000-0003-3883-4169\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"De Bonis, Michele\"},{\"affiliation\":\"Athena Research and Innovation Centre\",\"name\":\"Dimitropoulos, Harry\"},{\"affiliation\":\"Athena Research and Innovation Centre\",\"name\":\"Foufoulas, Ioannis\"},{\"affiliation\":\"University of Warsaw\",\"name\":\"Horst, Marek\"},{\"affiliation\":\"Athena Research and Innovation Centre\",\"name\":\"Iatropoulou, Katerina\"},{\"affiliation\":\"University of Warsaw\",\"name\":\"Jacewicz, Przemyslaw\"},{\"affiliation\":\"Athena Research and Innovation Centre\",\"name\":\"Kokogiannaki, Argiro\", \"orcid\":\"0000-0002-3880-0244\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"La Bruzzo, Sandro\",\"orcid\":\"0000-0003-2855-1245\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"Lazzeri, Emma\"},{\"affiliation\":\"University of Bielefeld\",\"name\":\"Löhden, Aenne\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"Manghi, Paolo\",\"orcid\":\"0000-0001-7291-3210\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"Mannocci, Andrea\",\"orcid\":\"0000-0002-5193-7851\"},{\"affiliation\":\"Athena Research and Innovation Center\",\"name\":\"Manola, Natalia\"},{\"affiliation\":\"ISTI - CNR\",\"name\":\"Ottonello, Enrico\"},{\"affiliation\":\"University of Bielefeld\",\"name\":\"Shirrwagen, Jochen\"}],\"description\":\"\\u003cp\\u003eThis dump provides access to the metadata records of publications, research data, software and projects that may be relevant to the Corona Virus Disease (COVID-19) fight. The dump contains records of the OpenAIRE COVID-19 Gateway (https://covid-19.openaire.eu/), identified via full-text mining and inference techniques applied to the OpenAIRE Research Graph (https://explore.openaire.eu/). The Graph is one of the largest Open Access collections of metadata records and links between publications, datasets, software, projects, funders, and organizations, aggregating 12,000+ scientific data sources world-wide, among which the Covid-19 data sources Zenodo COVID-19 Community, WHO (World Health Organization), BIP! FInder for COVID-19, Protein Data Bank, Dimensions, scienceOpen, and RSNA. \\u003cp\\u003eThe dump consists of a gzip file containing one json per line. Each json is compliant to the schema available at https://doi.org/10.5281/zenodo.3974226\\u003c/p\\u003e \",\"title\":\"OpenAIRE Covid-19 publications, datasets, software and projects metadata.\",\"upload_type\":\"dataset\",\"version\":\"1.0\"}}";
Can we define this as a classpath test resource?
yes, I will
@ -0,0 +16,4 @@
private Pid pid;
private List<String> affiliation;
We shouldn't expose author affiliations, I believe they create confusion with the result - organization relationships we're already providing.
Ok, I will remove that part. There is no problem for the dump of community products, since we do not provide relations. I think we could leave it for communities and remove it for the dump of the whole graph.
As for the graph complete graph, it should be removed, yes.
Instead for the communities, it is extra information that potentially could be useful for the dump consumers, but I still have the feeling that it is an information that we didn't analyse enough to have a more clear picture of its usefulness.
We should start with something simple:
So, IMO we can keep it, but until we won't be sure about basic statistics proving its usefulness, I doubt we should advertise it.
Ok then, I will remove it also for RC/RI. In case we can always add it in a second time
@ -0,0 +3,4 @@
import java.io.Serializable;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
Can we have a separate factory class to build a ControlledField from a StructuredProperty?One model definition should never directly depend from the other, the mapping should express such dependency instead.
I will change the implementation of the method that brings the dependency
Actually there was no need. The import was just a left over. I should have already done it and just did not remember
@ -0,0 +7,4 @@
import eu.dnetlib.dhp.schema.oaf.ExtraInfo;
//ExtraInfo renamed ExternalReference do not confuse with ExternalReference in oaf schema
public class ExternalReference implements Serializable {
I see two major issues in this model class:
ExternalReference
clearly clashes with the internal model type;value
) containing the citation XML representation. I am sure we do not want to expose them in our official dumps in such mixed encodings.I do not remember why we chose to dump the ExtraInfo in the internal model and name it ExternalReference in the public model. I do agree with you: we do not want to expose mixed encodings. But we must decide what we want to deliver. Do we want to provide the ExternalReference in our internal model or do we want to provide citation information? In the first case we need to redefine the mapping. In the second case, we need to decide how to parse the xml containing the citation and which are the values we want to provide
The general idea is to expose citations as relationships between objects found in the graph, so at the moment this would imply the implementation of a mapping from the current XML based encoding to a target representation based on the common dump
Relation
model. Personally I would avoid to further slow down the integration of this PR and postpone the implementation of such a mapping to a (near) future enhancement for the dump procedure.We still need to discuss if we want to expose also the citation texts describing objects that are not found in the graph, and if yes, how to encode them in the dump.
Ok so, for the moment ExternalReferences will be removed from the dump data model:
@ -0,0 +22,4 @@
// ( article | book ) processing charges. Defined here to cope with possible wrongly typed
// results
// private Field<String> processingchargeamount;
Commented code lines/blocks should be removed.
ack
@ -0,0 +170,4 @@
this.subjects = subjects;
}
// public String getPolicies() {
Commented code lines/blocks should be removed.
ack
@ -0,0 +16,4 @@
private String jurisdiction;
// public String getId() {
Commented code lines/blocks should be removed.
ack
@ -0,0 +76,4 @@
this.pid = pid;
}
// public List<KeyValue> getCollectedfrom() {
Commented code lines/blocks should be removed.
ack
@ -90,0 +91,4 @@
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>4.7.2</version>
Sub modules should never declare the version of an external library; please move the dependency version in the main pom, where all the dependencies are declared along with their version.
ack
@ -41,6 +41,43 @@
</build>
<dependencies>
<!-- <dependency>-->
Please remove unnecessary dependencies
in the same pom there are also version associated to external libraries. I will move them in the main pom as for the comment above
@ -0,0 +1,105 @@
/**
Empty Javadoc declaration? :) Please provide a synthetic one in the correct place, right before the class name declaration :)
ack
@ -0,0 +52,4 @@
Element root = doc.getRootElement();
map.put(root.attribute("id").getValue(), root.attribute("label").getValue());
} catch (DocumentException e) {
e.printStackTrace();
Please allow the exception to propagate to the caller, printing it is not that helpful.
Done.
@ -0,0 +1,84 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,83 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,62 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,187 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,48 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +4,4 @@
import java.io.Serializable;
public class Constants implements Serializable {
// collectedFrom va con isProvidedBy -> becco da ModelSupport
I guess this comment is a leftover, either translate it in english, or delete it :)
It was a leftover as the one directly following :). Removed
@ -0,0 +1,84 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,105 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,125 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +71,4 @@
cce.execute(Process::getRelation, CONTEX_RELATION_DATASOURCE, ModelSupport.getIdPrefix(Datasource.class));
log.info("Creating relations for projects... ");
// cce
Commented code lines/blocks should be removed.
these lines are left on purpose. We shall remember to add also the generation of relations between context a projects once the coverage in the community profiles of projects having their OpenAIRE id will be higher. I would be for de-commenting it, and getting the relation we can have.
Fine, this is the kind of comments that are well suited to sit near a commented code block.
@ -0,0 +1,502 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +436,4 @@
return f;
} catch (DocumentException e) {
e.printStackTrace();
Let the exception propagate to the caller, printing it is not that useful.
ack
@ -0,0 +1,197 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,98 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,88 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,51 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,110 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,56 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
@ -0,0 +1,157 @@
/**
Consider to move the Javadoc right before the class name declaration.
ack
A general remark is about the lack of javadoc in the dump model classes. As I already know you got lots of descriptions for them in the google document, please consider to move them as proper javadoc definitions.
Except for a few issues in the model definition that should be further discussed, the majority the comments are about minor changes.
Thanks for this HUGE contribution :)
@ -37,3 +37,3 @@
<arg>--hdfsNameNode</arg><arg>${nameNode}</arg>
<arg>--fileURL</arg><arg>${projectFileURL}</arg>
<arg>--hdfsPath</arg><arg>${workingDir}/projects</arg>
<arg>--hdfsPath</arg><arg>${workingDir}/project</arg>
Was this workflow changed on purpose? I assume the project enrichment workflow has nothing to do with the dump procedures. If this change was not intentional, please revert it.
The change was not intentional and I will revert it. Anyway it will not affect the execution of the process since it is a workingDir directory used to store information, that is referred to in the same way everywhere in the workflow
PR manually integrated by
5b994d7ccf
Pull request closed