beta to master May 2024 #437

Manually merged
claudio.atzori merged 0 commits from beta into beta_to_master_may2024 2024-05-21 14:51:35 +02:00

This PR promotes to the master branch the set of changes tested in the context of the May2024 content update cycle (#9621)

The graph internal schema module introduces the following changes since v4.17.2 used in PROD

  • 4.17.3 [Graph model] moved context at the level of the entity.
  • 5.17.3 [Graph model] added result level textual field to store the transformative agreement information.
  • 6.1.0 [Solr model] Introduced model classes to provide a JSON representation of records embedding information from the related entities.
  • 6.1.1 [Model constants] added some MAG constants used in the context of the DOIBoost dismission
  • 6.1.2 [Solr model] added field to the JSON solr records project.oamandatepublications

Content acquisition

  • #403 - mapped oaf:country from results
  • #401 - Revised Open Citation integration procedure
  • #400 - new plugin to collect from a dump of BASE
  • #397 - FOS ActionSet for the classification of results without a DOI
  • #395 - Revised procedure when converting json data into xml
  • #394 - Orcid Update Procedure
  • #418 - The monolithic DOIBoost datasource aggregated as separated datasources, each with its own dedicated aggregation workflow
  • #432 - Retry mechanism for the REST metadata collector plugin
  • #435 - NEW!!! Modification of Microsoft Academic Graph Mapping - we've opted to incorporate only MAG items with DOIs into the graph and designate them as hidden (invisible = true)

Changes to the graph pipeline

  • #407 - adding context information to projects and datasources
  • #404 - refactoring the Oaf records merge utilities into dhp-common
  • #398 - Enrich authors with ORCID info using new matching algorithm
  • #393 - Revised instance type comparisons in dedup phase
  • #422 - Refactoring the Oaf records merge utilities into dhp-common
  • #429 - Miscellaneous related to changes in MergeUtils

Graph provision

  • #399 - Solr JSON payload
  • #434 - Various fixes in the graph provisioning workflow
  • #407 - correctly selecting the active hdfs node for the impala cluster
  • #372 - Changes to indicators and funders definition, added workflows dhp-stats-monitor-irish for updating the irish monitor stats DBs and dhp-stats-hist-snaps for creating and updating a db of some historical snapshots from the graph.
  • #408 - added missing EOS, Generate tables with parquet-files, instead of csv in the contexts.sh script
  • #411 - fixed typo in indicator query, added more institutions
  • #416 - fixed the result_country definition and updated the stats DB copy procedure
  • #421 - Improvements to copying data from ocean to impala
  • #424 - Indicator fixes
  • #430 - Various fixes in the stats wf
This PR promotes to the master branch the set of changes tested in the context of the May2024 content update cycle ([#9621](https://support.openaire.eu/issues/9621)) The graph internal schema module introduces the following changes since v4.17.2 used in PROD * 4.17.3 [Graph model] moved context at the level of the entity. * 5.17.3 [Graph model] added result level textual field to store the transformative agreement information. * 6.1.0 [Solr model] Introduced model classes to provide a JSON representation of records embedding information from the related entities. * 6.1.1 [Model constants] added some MAG constants used in the context of the DOIBoost dismission * 6.1.2 [Solr model] added field to the JSON solr records project.oamandatepublications Content acquisition * #403 - mapped oaf:country from results * #401 - Revised Open Citation integration procedure * #400 - new plugin to collect from a dump of BASE * #397 - FOS ActionSet for the classification of results without a DOI * #395 - Revised procedure when converting json data into xml * #394 - Orcid Update Procedure * #418 - The monolithic DOIBoost datasource aggregated as separated datasources, each with its own dedicated aggregation workflow * #432 - Retry mechanism for the REST metadata collector plugin * #435 - **NEW!!!** Modification of Microsoft Academic Graph Mapping - we've opted to incorporate only MAG items with DOIs into the graph and designate them as hidden (invisible = true) Changes to the graph pipeline * #407 - adding context information to projects and datasources * #404 - refactoring the Oaf records merge utilities into dhp-common * #398 - Enrich authors with ORCID info using new matching algorithm * #393 - Revised instance type comparisons in dedup phase * #422 - Refactoring the Oaf records merge utilities into dhp-common * #429 - Miscellaneous related to changes in MergeUtils Graph provision * #399 - Solr JSON payload * #434 - Various fixes in the graph provisioning workflow * #407 - correctly selecting the active hdfs node for the impala cluster * #372 - Changes to indicators and funders definition, added workflows *dhp-stats-monitor-irish* for updating the irish monitor stats DBs and *dhp-stats-hist-snaps* for creating and updating a db of some historical snapshots from the graph. * #408 - added missing EOS, Generate tables with parquet-files, instead of csv in the contexts.sh script * #411 - fixed typo in indicator query, added more institutions * #416 - fixed the result_country definition and updated the stats DB copy procedure * #421 - Improvements to copying data from ocean to impala * #424 - Indicator fixes * #430 - Various fixes in the stats wf
claudio.atzori added 232 commits 2024-05-21 14:10:56 +02:00
29194472a7 Promote "Research" to a jolly instanceType in dedup comparisons
Compare Part of book or chapter of book with Article
d65285da7f Promote "Research" to a jolly instanceType in dedup comparisons
Compare "Journal" and "Part of book or chapter of book" with "Article"
773e856550 Revised procedure when converting json data into xml:
- json object keys are renamed to be conformant to xml tag elements, special characters are substituted or removed
- json string values are no longer post-processed as they are already escaped by the org.json.XML.toString method
915a76a796 following the comment on the pull requests:
- Added #NUM_OF_THREADS complete job in the queue at the end of  the main loop to avoid deadlock
ad0e9aa80c added first part of refactoring of the code generating MAG,
make it more readable using spark sql queries
c532831718 Moved Crossref Mapping on dhp-aggregations,
refactored code, avoid to use utility for create part of the oaf defined in DOIBoostMappingUtils, used instead utility in OafMappingUtils
98dc042db5 mapping generated for MAG,
missing generation of Organization Action set
43b454399f - Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal
- AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations
49af2e5740 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views; overcome some bash-instabilities.
- Upon any error, fail the whole process, not just the current DB-creation, as those errors usually indicate a bug in the initial DB-creation, that should be fixed immediately.
- Enhance parallel-copy of large files by "hadoop distcp" command.
- Reduce the "invalidate metadata" commands to just the current DB's tables, in order to eliminate the general overhead on Impala.
- Show the number of tables and views in the logs.
- Fix some log-messages.
1878199dae Miscellaneous fixes:
- in Merge By ID pick by preference those records coming from delegated Authorities
- fix various tests
- close spark session in SparkCreateSimRels
claudio.atzori manually merged commit c3fe59bc78 into beta_to_master_may2024 2024-05-21 14:51:35 +02:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#437
No description provided.