1
0
Fork 0
Commit Graph

4563 Commits

Author SHA1 Message Date
Lampros Smyrnaios ce0aee21cc Improve performance of transferring the stats-DBs to another cluster and querying the DBs' tables, by ordering Spark to create up to 100 files per table, instead of thousands. 2024-07-03 20:15:33 +03:00
Lampros Smyrnaios 7b7dd32ad5 - Fix placement of some "set mapred.job.queue.name=analytics" statements and remove their unused "/*EOS*/" indicator.
- Add stacktrace-info to failed actions.
2024-07-03 19:53:24 +03:00
Lampros Smyrnaios 7ce051d766 - Update the remaining hive-actions to spark-actions.
- Update the version of shell-actions.
- Fix missing "/*EOS*/" indicators.
2024-07-03 19:49:19 +03:00
Lampros Smyrnaios aa4d7d5e20 Prioritize the rest of the stats-queries over other tasks on the cluster, by putting them in the "analytics" queue. 2024-07-03 19:14:25 +03:00
Claudio Atzori bb12d0b4df removed legacy actionmanager dependencies 2024-07-03 16:26:39 +02:00
Lampros Smyrnaios 54e11b6a43 Improve performance and efficiency by rewriting the creation process of "publication", "project", "dataset", "datasource", "software", "otherresearchproduct" and "result" tables, to be performed in a single query, for each one. 2024-07-03 13:03:15 +03:00
Miriam Baglioni a2b708bb71 [AffiliationIngestion]refactoring 2024-06-29 18:36:47 +02:00
Miriam Baglioni 9cbe966b4a [AffiliationIngestion]refactoring 2024-06-29 18:35:49 +02:00
Miriam Baglioni 236b64d830 [AffiliationIngestion]Extended the ingestion of affiliation from open aire to include also links derived from Web Crawl. Extended the test. Inserted in Constatns the id and name of the webcrawl datasource to be used here and also in the ingestion of links from web crawl 2024-06-29 18:29:20 +02:00
Miriam Baglioni 67ff783e65 [Person]First implementation to include Person entity in the graph 2024-06-29 17:13:01 +02:00
Michele De Bonis a10e8d9f05 implementation of countryMatch and addition of workflow parameters 2024-06-28 16:46:52 +02:00
Claudio Atzori 14539f9c8b [graph provision] publicFormat worfklow parameter defined as optional 2024-06-28 14:55:18 +02:00
Claudio Atzori 1bc8c5d173 [graph provision] fixed serialization of the instancetypes 2024-06-28 14:54:28 +02:00
Claudio Atzori 1ccf01cdb8 Using the updated Solr JSON payload model classes 2024-06-28 12:38:07 +02:00
Claudio Atzori b79cb155ba Merge pull request 'Fix permissions-issue in Stats-workflow, step22a-createPDFsAggregated.' (#450) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#450
2024-06-26 10:11:34 +02:00
Claudio Atzori 33a02c5b9e Merge pull request 'Change the selection criteria for the pivot record of a group so that by best pid type becomes the first criteria. This will have the effect to converge to records having DOI pid' (#446) from pivotselectionbypid into beta
Reviewed-on: D-Net/dnet-hadoop#446
2024-06-26 10:10:13 +02:00
Lampros Smyrnaios fe2275a9b0 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions
# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql
2024-06-25 20:17:47 +03:00
Claudio Atzori 1c30eacac2 updated index feeding procedure to exploit the collection aliases 2024-06-25 15:27:38 +02:00
Claudio Atzori 6055212f77 merged from the json_payload branch 2024-06-25 12:39:02 +02:00
Claudio Atzori 0031cf849e Merge branch 'beta' into 9872-create-solr-collection-aliases 2024-06-25 09:58:01 +02:00
Serafeim Chatzopoulos 9f6e16a03c Add support to cretate/update solr collection aliases 2024-06-20 16:03:15 +03:00
Lampros Smyrnaios 66cd28f70a - Fix not using the "export HADOOP_USER_NAME" statement in "createPDFsAggregated.sh", which caused permission-issues when creating tables with Impala.
- Remove unused "--user" parameter in "impala-shell" calls.
- Code polishing.
2024-06-20 14:33:46 +03:00
Miriam Baglioni d35edac212 [IrishFunderList]make changed according to 9635 comment 20, 21, 22 and 23 2024-06-20 12:28:28 +02:00
Miriam Baglioni 6421f8fece Merge remote-tracking branch 'origin/beta' into beta 2024-06-19 11:12:15 +02:00
Miriam Baglioni ac270f795b [IrishFunderList]make changed according to 9635 comment 14, 15 and 16 2024-06-19 11:11:52 +02:00
Lampros Smyrnaios 285416c74e Merge branch 'beta' into beta 2024-06-18 13:50:38 +02:00
Lampros Smyrnaios 3095047e5e Miscellaneous updates to the copying operation to Impala Cluster:
- Fix not breaking out of the VIEWS-infinite-loop when the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" is set to "false".
- Exit the script when no HDFS-active-node was found, independently of the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR".
- Fix view_name-recognition in a log-message, by using the more advanced "Perl-Compatible Regular Expressions" in "grep".
- Add error-handling for "compute stats" errors.
2024-06-18 14:40:41 +03:00
Antonis Lempesis 0456f1b788 Merge remote-tracking branch 'origin/beta' into beta 2024-06-14 15:11:30 +03:00
Antonis Lempesis 38636942c7 filtering out deletedbyinference and invinsible results from accessroute 2024-06-14 15:11:19 +03:00
Lampros Smyrnaios d942a1101b Miscellaneous updates to the copying operation to Impala Cluster:
- Show some counts and the elapsed time for various sub-tasks.
- Code polishing.
2024-06-14 12:14:38 +03:00
Giambattista Bloisi 9bf2bda1c6 Fix: next returned a null value at end of stream 2024-06-12 13:28:51 +02:00
Giambattista Bloisi d90cb099b8 Fix for paginationStart parameter management 2024-06-11 20:23:44 +02:00
Giambattista Bloisi 4f2a61e10f Change the selection criteria for the pivot record of a group so that by best pid type becomes the first criteria. This will have the effect to slowly converge to records having DOI pid 2024-06-11 15:33:56 +02:00
Claudio Atzori 11fe3a4fe0 [graph resolution] use sparkExecutorMemory to define also the memoryOverhead 2024-06-11 14:21:17 +02:00
Miriam Baglioni 8fe934810f Merge remote-tracking branch 'origin/beta' into beta 2024-06-11 10:28:51 +02:00
Miriam Baglioni 9da006e98c [SDGFoSActionSet]remove datainfo for the result. It is not needed (qualifier.classid = UPDATE) useless since subject do not go at the level of the instance 2024-06-11 10:28:32 +02:00
Giambattista Bloisi 85c1eae7e0 Fixes for pagination strategy looping at end of download 2024-06-10 19:03:58 +02:00
Claudio Atzori b0eba210c0 [actionset promotion] use sparkExecutorMemory to define also the memoryOverhead 2024-06-10 16:15:24 +02:00
Claudio Atzori 3776327a8c hostedby patching to work with the updated Crossref contents, resolved conflict 2024-06-10 15:24:12 +02:00
Claudio Atzori 0139f23d66 Merge pull request 'organization type from OpenOrgs' (#445) from import_openorg_type into beta
Reviewed-on: D-Net/dnet-hadoop#445
2024-06-07 12:17:31 +02:00
Michele Artini c726572418 changed some parameters in OSF test 2024-06-07 12:03:26 +02:00
Claudio Atzori ec79405cc9 [graph raw] set organization type from openorgs 2024-06-07 11:30:31 +02:00
Miriam Baglioni 1477406ecc [bulkTag] fixed issue that made project disappear in graph_10_enriched 2024-06-06 10:45:41 +02:00
Claudio Atzori 92c3abd5a4 [graph cleaning] use sparkExecutorMemory to define also the memoryOverhead 2024-06-06 10:44:33 +02:00
Claudio Atzori ce2364743a applying changes from PR#442: Fix for missing collectedfrom after dedup 2024-06-06 10:43:43 +02:00
Claudio Atzori f70dc76b61 minor 2024-06-06 10:43:10 +02:00
Claudio Atzori 73bd1938a5 [graph2hive] use sparkExecutorMemory to define also the memoryOverhead 2024-06-05 12:17:35 +02:00
Claudio Atzori da5c1e73a4 Merge pull request 'Irish oaipmh exporter' (#443) from irish-oaipmh-exporter into beta
Reviewed-on: D-Net/dnet-hadoop#443
2024-06-05 10:55:09 +02:00
Claudio Atzori 81090ad593 [IE OAIPHM] added oozie workflow, minor changes, code formatting 2024-06-05 10:03:33 +02:00
Claudio Atzori a02f3f0d2b code formatting 2024-05-30 10:21:18 +02:00
Alessia Bardi 05ee783c07 Merge branch 'beta' into dblp_collection_plugin 2024-05-29 16:04:39 +02:00
Claudio Atzori c272c4ad68 code formatting 2024-05-29 15:50:07 +02:00
Alessia Bardi c5f4da16a4 Merge branch 'beta' into rest-collector-request-header-map 2024-05-29 15:46:23 +02:00
Alessia 1b165a14a0 Rest collector plugin on hadoop supports a new param to pass request headers 2024-05-29 15:41:36 +02:00
Michele Artini e996787be2 OSF test 2024-05-29 15:05:17 +02:00
Claudio Atzori 62716141c5 Merge pull request 'Miscellaneous updates to the copying operation to Impala Cluster' (#440) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#440
2024-05-29 14:34:51 +02:00
Miriam Baglioni 5d85b70e1f [NOAMI] removed Ireland funder id 501100011103. ticket 9635 2024-05-29 11:55:00 +02:00
Lampros Smyrnaios e3f28338c1 Miscellaneous updates to the copying operation to Impala Cluster:
- Assign the WRITE and EXECUTE permissions to the DBs' HDFS-directories, in order to be able to create tables on top of them, in the Impala Cluster.
- Make sure the "copydb" function returns early, when it encounters a fatal error, while respecting the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" config.
2024-05-28 17:51:45 +03:00
Miriam Baglioni 75d5ddb999 Update to include a blackList that filters out the results we know are wrongly associated to IE - update workflow definition - the blacklist parameter 2024-05-27 12:01:28 +02:00
Miriam Baglioni 87c9c61b41 Update to include a blackList that filters out the results we know are wrongly associated to IE - refactoring 2024-05-27 12:01:16 +02:00
Miriam Baglioni b55fed09f8 Update to include a blackList that filters out the results we know are wrongly associated to IE 2024-05-27 12:01:01 +02:00
Claudio Atzori 107d958b89 [org dedup] avoid NPEs in SparkPrepareNewOrgs 2024-05-27 11:59:54 +02:00
Claudio Atzori 3a7a6ecc32 [org dedup] avoid NPEs in SparkPrepareOrgRels 2024-05-27 11:59:45 +02:00
Claudio Atzori 1af4224d3d [org dedup] avoid NPEs in SparkPrepareOrgRels 2024-05-27 11:59:33 +02:00
Claudio Atzori 0d5bdb2db0 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-05-27 11:59:02 +02:00
Claudio Atzori 66548e6a83 Merge pull request 'changes in copy script' (#438) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#438
2024-05-27 11:54:03 +02:00
Lampros Smyrnaios 888637773c Add missing "/*EOS*/" comments. 2024-05-27 12:34:49 +03:00
Lampros Smyrnaios e0ac494859 Merge branch 'beta' into convert_hive_to_spark_actions
# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15_5.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_5.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step2.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql
2024-05-27 12:27:40 +03:00
Antonis Lempesis 15b54a345a added fos lvl4 2024-05-24 13:21:28 +03:00
Lampros Smyrnaios b48ed6e617 Change configuration in the copy-operation to Impala Cluster:
Set the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" parameter to "false".
2024-05-23 16:58:12 +03:00
Lampros Smyrnaios 68322843e2 Small updates to the copy-operation to Impala Cluster:
- Add a configuration-"switch" to control whether the script exits upon an error or not.
- Allow the script to exit when a table could not be created.
- Show the elapsed time for processing each database.
2024-05-23 15:07:49 +03:00
Lampros Smyrnaios c7b32bbacc Update CopyDataToImpalaCluster:
Update the code of acquiring the entities from Ocean cluster, through hive, in order to optimize the process and account for additional reserved keywords in Impala.

Co-authored-by: Antonis Lempesis <antleb@di.uoa.gr>
2024-05-23 13:00:19 +03:00
Sandro La Bruzzo f1fe363b19 merged again from beta (I hope for the last time) 2024-05-22 11:08:52 +02:00
Sandro La Bruzzo 66c1ffc866 merged again from beta (I hope for the last time) 2024-05-22 11:02:46 +02:00
Claudio Atzori 1ea67eba82 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-05-21 13:48:48 +02:00
Claudio Atzori 834461ba26 [graph provision]fixed wf definition, revised serialization of the usage counts measures 2024-05-21 13:48:06 +02:00
Sandro La Bruzzo e8a61d5dd5 removed plugin, use only FileGZip plugin 2024-05-21 13:45:29 +02:00
Sandro La Bruzzo ca9414b737 Implement multiple node name splitter on GZipCollectorPlugin and all nodes that use XMLIterator. If the splitter name contains is a comma separated values it splits for all the values 2024-05-21 09:11:13 +02:00
Sandro La Bruzzo 032bcc8279 since last beta workflow we decide to introduce in the graph only MAG item with DOI and set them invisible ( this should be the same behaviour of the previous DOIBoost mapping).
This commit apply this type of mapping
2024-05-20 09:24:15 +02:00
Sandro La Bruzzo 103e2652b3 merged beta 2024-05-17 14:43:07 +02:00
Sandro La Bruzzo a87f9ea643 fixed scholexplorer bug 2024-05-17 14:16:43 +02:00
Sandro La Bruzzo 6efab4d88e fixed scholexplorer bug 2024-05-16 16:19:18 +02:00
Claudio Atzori 92f018d196 [graph provision] fixed path pointing to an intermediate data store in the working directory 2024-05-15 15:39:18 +02:00
Claudio Atzori 0611c81a2f [graph provision] using Qualifier.classNames to populate the correponsing fields in the JSON payload 2024-05-15 15:33:10 +02:00
Michele Artini 2b3b5fe9a1 oai finalization and test 2024-05-15 14:13:16 +02:00
Claudio Atzori 1efe7f7e39 [graph provision] upgrade to dhp-schema:6.1.2, included project.oamandatepublications in the JSON payload mapping, fixed serialisation of the usageCounts measures 2024-05-14 12:39:31 +02:00
Claudio Atzori f7d56e2ef2 Merge branch 'beta' into rest-collector-plugin-with-retry 2024-05-10 09:02:21 +02:00
Claudio Atzori dc3a5858f7 Merge branch 'beta' into beta_provision_relation 2024-05-09 14:14:43 +02:00
Claudio Atzori 55f39f7850 [graph provision] adds the possibility to validate the XML records before storing them via the validateXML parameter 2024-05-09 14:06:04 +02:00
Claudio Atzori 39a2afe8b5 [graph provision] fixed XML serialization of the usage counts measures, renamed workflow actions to better reflect their role 2024-05-09 13:54:42 +02:00
Claudio Atzori 908ed9da7a Merge pull request 'Various fixes in the stats wf' (#430) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#430
2024-05-08 13:41:02 +02:00
Antonis Lempesis 0cada3cc8f every step is run in the analytics queue. Hardcoded for now, will make a parameter later 2024-05-08 13:42:53 +03:00
Antonis Lempesis 90a4fb3547 fixed typos 2024-05-08 13:17:58 +03:00
Claudio Atzori 18aa323ee9 cleanup unused classes, adjustments in the oozie wf definition 2024-05-08 11:36:46 +02:00
Michele Artini c9a327bc50 refactoring of gzip method 2024-05-08 11:34:08 +02:00
Michele Artini e234848af8 oaf record: xpath for root 2024-05-08 10:00:53 +02:00
Claudio Atzori b4e3389432 fixed property mapping creating the RelatedEntity transient objects. spark cores & memory adjustments. Code formatting 2024-05-07 16:25:17 +02:00
Giambattista Bloisi 711048ceed PrepareRelationsJob rewritten to use Spark Dataframe API and Windowing functions 2024-05-07 15:44:33 +02:00
Michele Artini 70bf6ac415 oai exporter tests 2024-05-07 09:36:26 +02:00
Michele Artini aa40e53c19 oai exporter parameters 2024-05-07 08:01:19 +02:00
Michele Artini ed052a3476 job for the population of the oai database 2024-05-06 16:08:33 +02:00
Claudio Atzori 26363060ed fixed id prefix creation for the fosnodoi records, again 2024-05-03 15:53:52 +02:00
Claudio Atzori 0486227185 [cleaning] deactivating the cleaning of FOS subjects found in the metadata provided by repositories 2024-05-03 14:31:12 +02:00
Claudio Atzori e1a0fb8933 fixed id prefix creation for the fosnodoi records 2024-05-03 14:14:18 +02:00
Sandro La Bruzzo 26bf8e763a merged from beta 2024-05-02 15:20:23 +02:00
Sandro La Bruzzo 0646d0d064 Updated main sparkApplication to avoid to require master variable 2024-05-02 15:15:03 +02:00
Claudio Atzori 4355f64810 reverted to version 1.2.5-SNAPSHOT 2024-05-02 11:23:53 +02:00
Claudio Atzori 66680b8b9a refactoring of common utilities 2024-05-02 11:16:58 +02:00
Claudio Atzori dcf23b3d06 Merge branch 'beta' into beta-release-1.2.5 2024-05-02 10:01:49 +02:00
Michele Artini f4068de298 code reindent + tests 2024-05-02 09:51:33 +02:00
Claudio Atzori 11bd89e132 [enrichment] use sparkExecutorMemory to define also the memoryOverhead 2024-05-01 08:32:59 +02:00
Claudio Atzori e96c2c1606 [ranking wf] set spark.executor.memoryOverhead to fine tune the resource consumption 2024-04-30 16:23:25 +02:00
Claudio Atzori 50c18f7a0b [dedup wf] revised memory settings to address the increased volume of input contents 2024-04-30 12:34:16 +02:00
Michele Artini 2615136efc added a retry mechanism 2024-04-30 11:58:42 +02:00
Sandro La Bruzzo 133ead1e3e updated new version of scholexplorer Generation 2024-04-29 09:00:30 +02:00
Sandro La Bruzzo 052c6aac9d formatted code 2024-04-26 16:03:04 +02:00
Sandro La Bruzzo 9cd3bc0f10 Added a new generation of the dump for scholexplorer tested with last version of spark, and strongly refactored 2024-04-26 16:02:07 +02:00
Claudio Atzori e2937db385 Merge branch 'beta' into misc_fixes_merge_entities 2024-04-24 08:55:28 +02:00
Giambattista Bloisi 1878199dae Miscellaneous fixes:
- in Merge By ID pick by preference those records coming from delegated Authorities
- fix various tests
- close spark session in SparkCreateSimRels
2024-04-24 08:12:45 +02:00
Sandro La Bruzzo 0d628cd62b merged again from beta 2024-04-23 17:34:55 +02:00
Lampros Smyrnaios 3c17183d10 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-23 17:18:16 +03:00
Lampros Smyrnaios 49af2e5740 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views; overcome some bash-instabilities.
- Upon any error, fail the whole process, not just the current DB-creation, as those errors usually indicate a bug in the initial DB-creation, that should be fixed immediately.
- Enhance parallel-copy of large files by "hadoop distcp" command.
- Reduce the "invalidate metadata" commands to just the current DB's tables, in order to eliminate the general overhead on Impala.
- Show the number of tables and views in the logs.
- Fix some log-messages.
2024-04-23 17:15:04 +03:00
Antonis Lempesis d2649a1429 increased the jvm ram 2024-04-23 16:03:16 +03:00
Claudio Atzori c3053ef34d using version 1.2.5-beta for the release 2024-04-23 14:52:32 +02:00
Claudio Atzori b5bcab13ec using version 1.2.5-beta for the release 2024-04-23 14:36:39 +02:00
Claudio Atzori 425c9afc36 using version 1.2.5-beta for the release 2024-04-23 14:30:04 +02:00
Claudio Atzori 93dd9cc639 code formatting 2024-04-23 11:28:00 +02:00
Miriam Baglioni 6189879643 [NOAMI] removed entry for Irish Research eLibray (IReL) Care Board from the list of funders. 2024-04-23 11:09:18 +02:00
Lampros Smyrnaios 69a9ac7393 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-22 17:07:11 +03:00
Miriam Baglioni 7de114bda0 [WebCrawl] addressing comments from PR 2024-04-22 13:52:50 +02:00
Sandro La Bruzzo 073f320c6a Added module containing all the dependencies, useful for spark deploy on k8. 2024-04-22 11:32:31 +02:00
Miriam Baglioni 776c898c4b [WebCrawl] adding affiliation relations from web information 2024-04-22 11:04:17 +02:00
Claudio Atzori 0656ab2838 code formatting 2024-04-20 08:10:58 +02:00
Claudio Atzori ab7f0855af fixed query reading projects from the aggregator DB 2024-04-20 08:10:32 +02:00
Claudio Atzori e5879b68c7 [transformative agreement] including reuslt-funder relations to the information imported from the TRs 2024-04-19 17:14:18 +02:00
Claudio Atzori 3a027e97a7 [graph indexing] sets spark memoryOverhead in the join operations to the same value used for the memory executor 2024-04-19 16:59:58 +02:00
Sandro La Bruzzo b72c3139e2 updated Ignore annotation that is deprecated to Disabled 2024-04-19 14:52:40 +02:00
Sandro La Bruzzo b84ad0c06e merged beta 2024-04-19 14:39:59 +02:00
Antonis Lempesis b52a5a753b Merge remote-tracking branch 'upstream/beta' into beta 2024-04-19 15:28:28 +03:00
Sandro La Bruzzo 8dd9cf84e2 code formatted 2024-04-19 12:30:59 +02:00
Lampros Smyrnaios 342223f75c Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-19 13:18:34 +03:00
Sandro La Bruzzo 342cb6189b fixed problem on changed signature on RowEncoder
removed property dhp.schema.artifact
2024-04-19 12:13:26 +02:00
Antonis Lempesis c3fe9662b2 all indicator tables are now stored as parquet 2024-04-19 12:45:36 +03:00
Lampros Smyrnaios 2616971e2b dhp-stats-update: remove leftover duplicate line 2024-04-18 16:18:16 +03:00
Lampros Smyrnaios ba533d9f34 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-18 15:47:56 +03:00
Lampros Smyrnaios d46b78b659 dhp-stats-update:
- Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance.
- Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action.
2024-04-18 15:40:27 +03:00
Lampros Smyrnaios 6f2ebb2a52 Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark. 2024-04-18 15:35:03 +03:00
Claudio Atzori 57c678d904 integrating changes from PR#424 2024-04-18 11:38:35 +02:00
Claudio Atzori 5ab8cd1794 Various fixes for the stats DB update workflow, step16-createIndicatorsTables.sql 2024-04-18 11:28:18 +02:00
Antonis Lempesis 0c71c58df6 fixed the definition of gold_oa 2024-04-18 12:01:27 +03:00
Antonis Lempesis 43d05dbebb fixed the definition of result_country 2024-04-18 11:53:50 +03:00
Antonis Lempesis e728a0897c fixed the definition of indi_pub_bronze_oa 2024-04-18 11:07:55 +03:00
Antonis Lempesis 308ae580a9 slight optimization in indi_pub_gold_oa definition 2024-04-18 10:57:52 +03:00
Antonis Lempesis 27d22bd8f9 slight optimization in indi_pub_gold_oa definition 2024-04-17 23:59:52 +03:00
Antonis Lempesis 1f5aba12fa slight optimization in indi_pub_gold_oa definition 2024-04-17 23:54:23 +03:00
Lampros Smyrnaios ca091c0f1e dhp-stats-update:
- Fix not passing some parameters to some Spark actions.
- Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box.
2024-04-17 14:03:59 +03:00
Claudio Atzori ac8747582c Merge branch 'beta' into doidoost_dismiss 2024-04-17 12:01:01 +02:00
Giambattista Bloisi 8ac167e420 Refinements to PR #404: refactoring the Oaf records merge utilities into dhp-common 2024-04-16 17:18:28 +02:00
Lampros Smyrnaios 0b897f2f66 Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts. 2024-04-16 18:17:54 +03:00
Miriam Baglioni 0625b9061f removed the funder id : 100011062 Asian Spinal Cord Network, wrongly associated to Ireland 2024-04-16 15:26:53 +02:00
Miriam Baglioni 9eeb9f5d32 mergin with branch beta 2024-04-16 15:24:40 +02:00
Claudio Atzori 589bce3520 Merge pull request '[pBETA] Improvements to copying data from ocean to impala' (#421) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#421
2024-04-16 14:22:32 +02:00
Sandro La Bruzzo a5ddd8dfbb Added Action set generation for the MAG organization 2024-04-16 13:39:15 +02:00
Giambattista Bloisi da333e9f4d Merge pull request 'Enhance Dedup authors matching with algorithms used for ORCID enhancements (task 9690)' (#419) from dedup_authorsmatch_bytoken into beta
Reviewed-on: D-Net/dnet-hadoop#419
2024-04-16 10:24:11 +02:00
Michele Artini 78b9d84e4a test 2024-04-16 09:41:16 +02:00
Giambattista Bloisi 43b454399f - Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal
- AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations
2024-04-15 18:19:29 +02:00
Lampros Smyrnaios db33f7727c Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones.
Note: Currently the code is set to only test the "Step1".
2024-04-15 16:22:40 +03:00
Lampros Smyrnaios d7da4f814b Minor updates to the copying operation to Impala Cluster:
- Improve logging.
- Code optimization/polishing.
2024-04-12 18:12:06 +03:00
Lampros Smyrnaios 14719dcd62 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views.
- Add check for successful execution of the "hadoop distcp" command.
- Add a check for successful copy operation of all entities.
- Upon facing an error in a DB, exit the method, instead of the whole script.
- Improve logging.
- Code polishing.
2024-04-12 15:36:13 +03:00
Sandro La Bruzzo 41a42dde64 code formatted 2024-04-11 17:43:48 +02:00
Sandro La Bruzzo 843dc95340 resolved conflict 2024-04-11 17:38:16 +02:00
Sandro La Bruzzo 1e30454ee0 added vocabulary tu instanceTypeMApping of Mag 2024-04-11 17:32:30 +02:00
Sandro La Bruzzo 2581672c11 updated wf of MAG and crossref to use transaction 2024-04-11 17:27:49 +02:00
Lampros Smyrnaios 22745027c8 Use the "HADOOP_USER_NAME" value from the "workflow-property", in "copyDataToImpalaCluster.sh", in "stats-monitor-updates". 2024-04-11 17:46:33 +03:00
Lampros Smyrnaios abf0b69f29 Upgrade the copying operation to Impala Cluster:
- Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources.
- Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times.
- Add error-checks for the creation of tables and views.
2024-04-11 17:12:12 +03:00
Sandro La Bruzzo a0642bd190 added instanceTypeMapping field on MAG 2024-04-11 13:10:12 +02:00
Sandro La Bruzzo 98dc042db5 mapping generated for MAG,
missing generation of Organization Action set
2024-04-05 18:12:53 +02:00
Sandro La Bruzzo ef582948a7 Updated mapping 2024-04-05 11:10:44 +02:00
Sandro La Bruzzo 5142f462b5 completed mapping from paper to OAF, not tested 2024-04-04 21:06:04 +02:00
Miriam Baglioni 0794e0667b Merge branch 'doidoost_dismiss' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doidoost_dismiss 2024-04-04 09:16:18 +02:00
Miriam Baglioni 4b1de076ac [DataciteHostedByMap] added entry for EBRAINS 2024-04-04 09:16:14 +02:00
Miriam Baglioni c8a88b2187 [DataciteHostedByMap] added entry for EBRAINS 2024-04-04 09:14:58 +02:00
Sandro La Bruzzo 31e152d2bb Merge remote-tracking branch 'origin/doidoost_dismiss' into doidoost_dismiss 2024-04-03 17:08:35 +02:00
Sandro La Bruzzo 6f3e925cae Implemented first part of the new MAG mapping 2024-04-03 17:07:14 +02:00
Miriam Baglioni f0f6abf892 [MapToFunderLink]added references for HFRI and Erasmus+ for the creation of links for funders 2024-04-03 14:59:09 +02:00
Claudio Atzori 26b97aa5ed Merge pull request '[BETA] fixed the result_country definition and updated the stats DB copy procedure' (#416) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#416
2024-04-03 12:36:03 +02:00
Lampros Smyrnaios b7c8acc563 - Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski".
- Fix typos.
2024-04-03 13:15:37 +03:00
Miriam Baglioni 50fbebf186 [NOAMI] removed entry for Health and Social Care Board from the list of funders. Modified IRC putting 1596 and 1597 as synonyms, as required in ticket 9635 2024-04-03 11:45:40 +02:00
Michele Artini 71d6e02886 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2024-04-03 09:50:41 +02:00
Michele Artini 02c9a311c8 base datainfo with trust=0.89 2024-04-03 09:50:21 +02:00
Miriam Baglioni 42846d3b91 [OpenCitation] add compression option when writing the sequence file 2024-04-03 09:25:00 +02:00
Miriam Baglioni 4f0a044245 Merge pull request 'Add action set creation for Datacite affiliations' (#413) from 9647_datacite_affiliations into beta
Reviewed-on: D-Net/dnet-hadoop#413
2024-04-02 17:33:38 +02:00
Serafeim Chatzopoulos cbe13a5c61 Fix datacite input path in properties file 2024-04-02 18:00:35 +03:00
Miriam Baglioni 9c9a9562ae [UsageCount] fixed error 2024-04-02 16:56:37 +02:00
Miriam Baglioni b42bdd5fb3 [UsageCount] add check in case the datasource is not matched against those present in the graph 2024-04-02 16:28:27 +02:00
Miriam Baglioni 64cbd8abe9 Merge pull request '[UsageCount] Usage count per result split by datasource' (#318) from UsageStatsRecordDS into beta
Reviewed-on: D-Net/dnet-hadoop#318
2024-04-02 10:21:39 +02:00
Antonis Lempesis df6e3bda04 added new orgs in monitor 2024-04-01 22:45:29 +03:00
Antonis Lempesis 573b081f1d added new orgs in monitor 2024-04-01 22:24:46 +03:00
Serafeim Chatzopoulos 0eb0701b26 Add action set creation for Datacite affiliations 2024-04-01 17:23:26 +03:00
Antonis Lempesis 0bf2a7a359 fixed the result_country definition 2024-04-01 15:23:22 +03:00