enrichmentSingleStep #373

Merged
claudio.atzori merged 19 commits from enrichmentSingleStep into beta 2024-01-10 16:58:50 +01:00

This PR avoids the materialization of one graph for each of the propagation steps. It allows to resume the execution of the enrichment from any step in the chain.

Properties to be provided:

sourcePath= the source graph
resumeFrom=default (it starts from OrcidPropagation. Other accepted options are:

  • BulkTagging : starts from the bulk tagging step and executes all the other
  • AffiliationInstitutionalRepository : starts from the propagation of affiliation for results collected from institutional repositories
  • AffiliationSemanticRelation : starts from the propagation of affiliation relations (and projects participation) exploting the Parent-Child relation
  • CommunityOrganization : starts from the propagation of community tag to result belonging to organization relevant for the communities
  • ResultProject : starts from the propagation step of relevance for one project from results in given semantic relations
  • CommunityProject : starts from the propagation of the community tag to results linked to projects relevant for the community
  • CommunitySemanticRelation : starts from the propagation of the community tag for result relevant to communities and linked to other results via given semantic relations
  • CountryPropagation : starts from the country propagation step
allowedsemrelsorcidprop= the semantic relations to be used in the orcid propagation step (isSupplementedBy;isSupplementTo)
allowedsemrelsresultproject= the semantics to be used in the ResultProject propagation step (isSupplementedBy;isSupplementTo)
allowedsemrelscommunitysemrel= the semantics to be used in the CommunitySemanticRelation step (isSupplementedBy;isSupplementTo)
datasourceWhitelistForCountryPropagation= The whiteList of datasources to be used for country propagation. It should list the openaire identifier of all the datasources with jurisdiction different from those given in the allowedtype parameter that should be used in country propagation (10|opendoar____::16e6a3326dd7d868cbc926602a61e4d0;10|openaire____::fdb035c8b3e0540a8d9a561a6c44f4de;10|eurocrisdris::fe4903425d9040f680d8610d9079ea14;10|openaire____::5b76240cc27a58c6f7ceef7d8c36660e;10|openaire____::172bbccecf8fca44ab6a6653e84cb92a;10|openaire____::149c6590f8a06b46314eed77bfca693f;10|eurocrisdris::a6026877c1a174d60f81fd71f62df1c1;10|openaire____::4692342f0992d91f9e705c26959f09e0;10|openaire____::8d529dbb05ec0284662b391789e8ae2a;10|openaire____::345c9d171ef3c5d706d08041d506428c;10|opendoar____::1c1d4df596d01da60385f0bb17a4a9e0;10|opendoar____::7a614fd06c325499f1680b9896beedeb;10|opendoar____::1ee3dfcd8a0645a25a35977997223d22;10|opendoar____::d296c101daa88a51f6ca8cfc1ac79b50;10|opendoar____::798ed7d4ee7138d49b8828958048130a;10|openaire____::c9d2209ecc4d45ba7b4ca7597acb88a2;10|eurocrisdris::c49e0fe4b9ba7b7fab717d1f0f0a674d;10|eurocrisdris::9ae43d14471c4b33661fedda6f06b539;10|eurocrisdris::432ca599953ff50cd4eeffe22faf3e48) Note: this list has not been checked against the master datasource obtained from datasource deduplication. It is something that should be fixed in the code implementing this step
allowedtypes= parameter that lists the types of datasources to be used in country propagation. It is based on the jurisdiction attribute (Institutional)
outputPath=/tmp/miriam/enrichment_one_step
pathMap = parameter to set the json path expression to be used in bulk tagging when verifying the contraints ({"author":"$['author'][*]['fullname']", \
  "title":"$['title'][*]['value']",\
  "orcid":"$['author'][*]['pid'][*][?(@['qualifier']['classid']=='orcid')]['value']" ,\
  "orcid_pending":"$['author'][*]['pid'][*][?(@['qualifier']['classid']=='orcid_pending')]['value']" ,\
  "contributor" : "$['contributor'][*]['value']",\
  "description" : "$['description'][*]['value']",\
  "subject" :"$['subject'][*]['value']" , \
  "fos" : "$['subject'][?(@['qualifier']['classid']=='FOS')].value" ,\
  "sdg" : "$['subject'][?(@['qualifier']['classid']=='SDG')].value",\
  "journal":"$['journal'].name",\
  "hostedby":"$['instance'][*]['hostedby']['key']",\
  "collectedfrom":"$['instance'][*]['collectedfrom']['key']",\
  "publisher":"$['publisher'].value",\
  "publicationyear":"$['dateofacceptance'].value"})
blacklist= the blacklist to be used for the AffiliationSemanticRelation step (empty)
allowedpids= list of pids allowed for the orcid propagation step (orcid;orcid_pending)
baseURL = the baseURL for the community APIs ( https://beta.services.openaire.eu/openaire/community/)
iterations= Number of iterations to be used in the AffiliationSemanticRelation step (1)```
This PR avoids the materialization of one graph for each of the propagation steps. It allows to resume the execution of the enrichment from any step in the chain. Properties to be provided: sourcePath= the source graph resumeFrom=default (it starts from OrcidPropagation. Other accepted options are: * BulkTagging : starts from the bulk tagging step and executes all the other * AffiliationInstitutionalRepository : starts from the propagation of affiliation for results collected from institutional repositories * AffiliationSemanticRelation : starts from the propagation of affiliation relations (and projects participation) exploting the Parent-Child relation * CommunityOrganization : starts from the propagation of community tag to result belonging to organization relevant for the communities * ResultProject : starts from the propagation step of relevance for one project from results in given semantic relations * CommunityProject : starts from the propagation of the community tag to results linked to projects relevant for the community * CommunitySemanticRelation : starts from the propagation of the community tag for result relevant to communities and linked to other results via given semantic relations * CountryPropagation : starts from the country propagation step ``` allowedsemrelsorcidprop= the semantic relations to be used in the orcid propagation step (isSupplementedBy;isSupplementTo) allowedsemrelsresultproject= the semantics to be used in the ResultProject propagation step (isSupplementedBy;isSupplementTo) allowedsemrelscommunitysemrel= the semantics to be used in the CommunitySemanticRelation step (isSupplementedBy;isSupplementTo) datasourceWhitelistForCountryPropagation= The whiteList of datasources to be used for country propagation. It should list the openaire identifier of all the datasources with jurisdiction different from those given in the allowedtype parameter that should be used in country propagation (10|opendoar____::16e6a3326dd7d868cbc926602a61e4d0;10|openaire____::fdb035c8b3e0540a8d9a561a6c44f4de;10|eurocrisdris::fe4903425d9040f680d8610d9079ea14;10|openaire____::5b76240cc27a58c6f7ceef7d8c36660e;10|openaire____::172bbccecf8fca44ab6a6653e84cb92a;10|openaire____::149c6590f8a06b46314eed77bfca693f;10|eurocrisdris::a6026877c1a174d60f81fd71f62df1c1;10|openaire____::4692342f0992d91f9e705c26959f09e0;10|openaire____::8d529dbb05ec0284662b391789e8ae2a;10|openaire____::345c9d171ef3c5d706d08041d506428c;10|opendoar____::1c1d4df596d01da60385f0bb17a4a9e0;10|opendoar____::7a614fd06c325499f1680b9896beedeb;10|opendoar____::1ee3dfcd8a0645a25a35977997223d22;10|opendoar____::d296c101daa88a51f6ca8cfc1ac79b50;10|opendoar____::798ed7d4ee7138d49b8828958048130a;10|openaire____::c9d2209ecc4d45ba7b4ca7597acb88a2;10|eurocrisdris::c49e0fe4b9ba7b7fab717d1f0f0a674d;10|eurocrisdris::9ae43d14471c4b33661fedda6f06b539;10|eurocrisdris::432ca599953ff50cd4eeffe22faf3e48) Note: this list has not been checked against the master datasource obtained from datasource deduplication. It is something that should be fixed in the code implementing this step allowedtypes= parameter that lists the types of datasources to be used in country propagation. It is based on the jurisdiction attribute (Institutional) outputPath=/tmp/miriam/enrichment_one_step pathMap = parameter to set the json path expression to be used in bulk tagging when verifying the contraints ({"author":"$['author'][*]['fullname']", \ "title":"$['title'][*]['value']",\ "orcid":"$['author'][*]['pid'][*][?(@['qualifier']['classid']=='orcid')]['value']" ,\ "orcid_pending":"$['author'][*]['pid'][*][?(@['qualifier']['classid']=='orcid_pending')]['value']" ,\ "contributor" : "$['contributor'][*]['value']",\ "description" : "$['description'][*]['value']",\ "subject" :"$['subject'][*]['value']" , \ "fos" : "$['subject'][?(@['qualifier']['classid']=='FOS')].value" ,\ "sdg" : "$['subject'][?(@['qualifier']['classid']=='SDG')].value",\ "journal":"$['journal'].name",\ "hostedby":"$['instance'][*]['hostedby']['key']",\ "collectedfrom":"$['instance'][*]['collectedfrom']['key']",\ "publisher":"$['publisher'].value",\ "publicationyear":"$['dateofacceptance'].value"}) blacklist= the blacklist to be used for the AffiliationSemanticRelation step (empty) allowedpids= list of pids allowed for the orcid propagation step (orcid;orcid_pending) baseURL = the baseURL for the community APIs ( https://beta.services.openaire.eu/openaire/community/) iterations= Number of iterations to be used in the AffiliationSemanticRelation step (1)```
claudio.atzori was assigned by miriam.baglioni 2024-01-10 11:31:44 +01:00
miriam.baglioni added 19 commits 2024-01-10 11:31:45 +01:00
claudio.atzori merged commit 16d858fbf0 into beta 2024-01-10 16:58:50 +01:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#373
No description provided.