Master branch updates from beta September 2023 #337

Manually merged
claudio.atzori merged 1271 commits from beta into master 2023-09-06 11:31:09 +02:00

This PR brings to the master branch the changes available from the beta branch at September 2023.

  • #284 impact indicators workflow #8172
  • #319 Import dnet-pace-core module in this project and use it after renaming to dhp-pace-core
  • #320 Import affiliation relations from Crossref and relative fix #335
  • #321, #322 [stats wf] Changes for promotion of production DBs to the new cluster
  • #323 fixed various unit tests
  • #325 graph cleaning, suggestions from ticket #8898
  • #326 [graph indexing] expand the instance level fulltext in the XML records
  • #328 Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleted by inference or that are pointing to dangling entities
  • #324 [dedup wf] Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4
  • #329 [dedup wf] DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag
  • #330 [dedup wf] Rewrite SparkPropagateRelation exploiting Dataframe API
  • #331 [dedup wf] Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb
  • #336 [graph raw] datainfo.invisible set as true only for entities
This PR brings to the master branch the changes available from the beta branch at September 2023. * #284 impact indicators workflow [#8172](https://support.openaire.eu/issues/8172) * #319 Import dnet-pace-core module in this project and use it after renaming to dhp-pace-core * #320 Import affiliation relations from Crossref and relative fix #335 * #321, #322 [stats wf] Changes for promotion of production DBs to the new cluster * #323 fixed various unit tests * #325 graph cleaning, suggestions from ticket [#8898](https://support.openaire.eu/issues/8898) * #326 [graph indexing] expand the instance level fulltext in the XML records * #328 Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleted by inference or that are pointing to dangling entities * #324 [dedup wf] Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4 * #329 [dedup wf] DispatchEntitiesSparkJob: manage all entity types together, support filtering by `dataInfo.invisible` flag * #330 [dedup wf] Rewrite SparkPropagateRelation exploiting Dataframe API * #331 [dedup wf] Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb * #336 [graph raw] datainfo.invisible set as true only for entities
giambattista.bloisi was assigned by claudio.atzori 2023-09-04 16:40:25 +02:00
miriam.baglioni was assigned by claudio.atzori 2023-09-04 16:40:25 +02:00
claudio.atzori added 1271 commits 2023-09-04 16:40:26 +02:00
793b5a8e5f Aggiornare 'dhp-workflows/dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/dump/ResultMapper.java'
Removing the dump of Measure at the level of the result. We decided not to map it
35e20b0647 updated resolution wf:
- generate a new version of the graph
 - changed merge from union to join
7af0bbd0b1 [scala-refactor] Module dhp-aggregation:
Moved all scala source into src/main/scala and src/test/scala
81bf604059 [scala-refactor] Module dhp-common:
Moved all scala source into src/main/scala and src/test/scala
bf880e2508 [scala-refactor] Module dhp-graph-mapper:
Moved all scala source into src/main/scala and src/test/scala
e9f285ec4d [scala-refactor] Module dhp-doiboost:
Moved all scala source into src/main/scala and src/test/scala
e5bff64f2e [scholexplorer]
- Minor fix on SparkConvertRDDtoDataset
-first implementation of retrieve datacite dump
63952018c0 [scholexplorer]
-moved SparkRetrieveDataciteDelta in scala folder
b881ee5ef8 [scholexplorer]
- implemented generation of scholix of delta update of datacite
0163dadb7f [doiboost]
- update MAG schema, new filed added on version dec-2021
d580e15442 Modified last intersection since we lost many titles.
this is my last resource, after that, I've to  change my job
81242538e6 Merge pull request 'Oozie workflow for cleancontext' (#216) from cleancontext into beta
Reviewed-on: #216

Looks good. We need to extend the cleaning workflow parameters to enable the extra step only when it is needed.
2a4bf32d4c Merge branch 'hive' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into hive
# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step10.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step13.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql
686580a220 - New Monitor DB workflow
- New Organization added
973d78a4d6 Update step15_5.sql
Added unpaywalls open access colors
98c34263ed Update step20-createMonitorDB.sql
Add University of Cape Town organization
43b23a9bf3 Update step20-createMonitorDB.sql
Added Technological University Dublin
9e1335df4c -Added Technological University Dublin
-Added project_organization_contribution table
9b41dff33c Update step20-createMonitorDB.sql
Added Delft University of Technology
c85de8fa1f -Added Technological University Dublin
-Added project_organization_contribution table
-Add   Delft University of Technology
e57ecdaf98 Update step20-createMonitorDB.sql
Add University of Manitoba
00d0d162b6 Update copyDataToImpalaCluster.sh
Added a temporary folder to copy the files to avoid permission issues
b3f9633205 Update copyDataToImpalaCluster.sh
Added option --user to impala-shell command
86f4f63daf Updates to steps related to transfer data to impala cluster
1. Remove external table definitions in stats_ext
2. Fix the issue where some views are not created.
3. Added two workflow parameters for copying also the usage stats dbs
d6102dd576 Update step16-createIndicatorsTables.sql
- Add org names to indi_project_collab_org
- Add indi_pub_bronze_oa
 - Changes to indi_pub_hybrid_oa_with_cc
ebe586b1d1 Impact indicators/Unpaywall
- Added Impact indicators
- Added unpaywall open access colours
2324670714 Split Monitor DBs-Interdisciplinarity indicators
- Split DBs Monitor for faster rendering of visualizations
- Add interdisciplinarity indicators from result_fos
fa24e2e18f Bug fix on indicators step
indi_pub_gold_oa table was missing during the creation of other indicators
2032b0df40 Bug fixes
1. Remove tables/views from old databases in the new cluster, before dropping the dbs
2. Fix id in result_accessroute, indi_impact_measures, indi_pub_bronze_oa
afcad08396 Update step20-createMonitorDB_institutions.sql
Added openorgs____::c0b262bd6eab819e4c994914f9c010e2   -- National Institute of Geophysics and Volcanology
4c770a5e29 Update finalizeImpalaCluster.sh
Drop views in shadow dbs before dropping the db
b0ade43608 Precompile blacklists patterns before evaluating clustering criteria
Enable Junit 5 tests in maven builds
Make path comparisons platform-independent
Read String resource files assuming they are encoded in UTF-8
Fix a few test conditions
36e0a8fec4 Changes to Promotion Stats WF
1. Add new cluster host at impala-shell commands
2. Add a step for splitting monitor dbs
3. Update workflow.xml to included the new splitting monitor dbs step
be2caedb04 Update step20-createMonitorDB_institutions.sql
Add openorgs____::1624ff7c01bb641b91f4518539a0c28a Vrije Universiteit Amsterdam
4648cd88d4 Update step15.sql
Cast score to double
a475cfcb7b Update step16-createIndicatorsTables.sql
Rename a field in indi_pub_interdisciplinarity
74cb060bfe Update step15_5.sql
Add "if not exists" clause
76901a25f9 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
95cd2b9b1e Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
schatz was assigned by claudio.atzori 2023-09-05 10:52:23 +02:00
claudio.atzori manually merged commit da0e9828f7 into master 2023-09-06 11:31:09 +02:00
Sign in to join this conversation.
No description provided.