Commit Graph

4076 Commits

Author SHA1 Message Date
Miriam Baglioni c334fe2438 Merge pull request 'Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleted by inference or that are pointing to dangling entities' (#328) from cleanup_relations_after_dedup into beta
Reviewed-on: D-Net/dnet-hadoop#328
2023-08-08 09:49:12 +02:00
Miriam Baglioni 0e2f855807 Merge pull request 'Updates Promotion DBs' (#321) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#321
2023-08-07 12:09:16 +02:00
Miriam Baglioni 18fbe52b20 Merge pull request 'Import affiliation relations from Crossref' (#320) from 8876 into beta
Reviewed-on: D-Net/dnet-hadoop#320
2023-08-07 10:45:30 +02:00
Giambattista Bloisi 97b6d1dc45 Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags
Filter relations also by dataInfo.invisible flag
2023-08-07 10:24:11 +02:00
Giambattista Bloisi af49424b59 Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities 2023-08-04 14:27:39 +02:00
Claudio Atzori 0bc74e2000 code formatting 2023-08-02 11:52:10 +02:00
Claudio Atzori 11ffb9bd68 rule out records with NULL dataInfo 2023-07-31 12:35:33 +02:00
Claudio Atzori ccac6a7f75 rule out records with NULL dataInfo 2023-07-31 12:35:05 +02:00
Serafeim Chatzopoulos 7cefe2665b Remove unnecessary classes 2023-07-28 19:14:39 +03:00
Serafeim Chatzopoulos 26a92ce762 Merge branch '8876' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8876 2023-07-28 19:03:57 +03:00
Serafeim Chatzopoulos ebfba38ab6 Add changes from code review 2023-07-28 19:03:47 +03:00
Serafeim Chatzopoulos eb8684a8cf Merge branch 'beta' into 8876 2023-07-28 13:39:33 +02:00
Claudio Atzori a72b9e96ac expand the instance level fulltext in the XML records 2023-07-27 14:57:38 +02:00
Claudio Atzori 59764145bb cherry picked & fixed commit 270df939c4 2023-07-25 17:39:00 +02:00
Claudio Atzori 270df939c4 partial implementation of the suggestions from https://support.openaire.eu/issues/8898 2023-07-25 17:29:50 +02:00
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Serafeim Chatzopoulos 3a0f09774a Add script to find score limits 2023-07-21 17:55:41 +03:00
Ilias Kanellos 06b9b71c4e Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 17:42:49 +03:00
Ilias Kanellos 2374f445a9 Produce additional bip update specific files 2023-07-21 17:42:46 +03:00
Serafeim Chatzopoulos cb0f3c50f6 Format workflow.xml 2023-07-21 16:07:10 +03:00
Serafeim Chatzopoulos c64e5e588f Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 15:27:02 +03:00
Serafeim Chatzopoulos 2cc5b1a39b Fixes in workflow.xml 2023-07-21 15:26:50 +03:00
Ilias Kanellos 0f96af5d56 Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 13:42:35 +03:00
Ilias Kanellos 03da965162 Format bip-score based file without doi references 2023-07-21 13:42:30 +03:00
Giambattista Bloisi f03153823a Update testCitationRelations number of expected citations according to changes made in 0559d8b4 (monodirectional citations) 2023-07-21 10:48:28 +02:00
Giambattista Bloisi 54c1eacef1 SparkJobTest was failing because testing workingdir was not cleaned up after eact test 2023-07-21 10:42:24 +02:00
Giambattista Bloisi 5e15f20e6e Fix entityMerger that was excluding the authors of the first entity in the list to merge 2023-07-21 00:46:54 +02:00
Giambattista Bloisi 0210a14e43 Ignore timestamp differences in PromoteActionPayloadForGraphTableJobTest 2023-07-20 23:45:57 +02:00
Giambattista Bloisi dba34505de Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values 2023-07-19 14:24:52 +02:00
Giambattista Bloisi e47ed1fdb2 Use DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES in json mapper to avoid that tests fail if they encounter unmapped properties 2023-07-19 14:21:40 +02:00
Serafeim Chatzopoulos db4ca43ee8 Resolve conflict 2023-07-18 18:38:26 +03:00
Serafeim Chatzopoulos be320ba3c1 Indentation fixes 2023-07-17 16:04:21 +03:00
dimitrispie be4856ef35 Update step15.sql 2023-07-17 15:33:58 +03:00
Serafeim Chatzopoulos bc1a4611aa Minor changes 2023-07-17 11:17:53 +03:00
Claudio Atzori 8af129b0c7 merged stats promotion step from antonis/promotion-prod-only 2023-07-13 15:03:28 +02:00
dimitrispie 706092bc19 Update updateProductionViews.sh 2023-07-13 15:48:12 +03:00
dimitrispie aedd279f78 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
2023-07-13 15:35:46 +03:00
dimitrispie 163b2ee2a8 Changes
1. Monitor updates
2. Bug fixes during copy to impala cluster
2023-07-13 15:25:00 +03:00
dimitrispie 76901a25f9 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
2023-07-12 22:49:08 +03:00
Serafeim Chatzopoulos 4eba14a80e Add oozie workflow 2023-07-06 21:07:50 +03:00
Serafeim Chatzopoulos c2998a14e8 Add basic tests for affiliation relations 2023-07-06 20:28:16 +03:00
Serafeim Chatzopoulos bc7b00bcd1 Add bi-directional affiliation relations 2023-07-06 18:29:15 +03:00
Serafeim Chatzopoulos 12528ed2ef Refactor PrepareAffiliationRelations.java to use OafMapperUtils common functions 2023-07-06 18:08:33 +03:00
Serafeim Chatzopoulos bbc245696e Prepare actionsets for BIP affiliations 2023-07-06 15:56:12 +03:00
Ilias Kanellos 0c433eccdd Fix scores & Workflow 2023-07-06 15:06:28 +03:00
Ilias Kanellos d5c39a1059 Fix map scores to doi 2023-07-06 15:04:48 +03:00
Ilias Kanellos 772d5f0aab Make PR and AttRank serial 2023-07-06 13:47:51 +03:00
Giambattista Bloisi 801da2fd4a New sources formatted by maven plugin 2023-07-06 10:28:53 +02:00
Giambattista Bloisi bd3fcf869a rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules 2023-07-06 10:02:23 +02:00
Serafeim Chatzopoulos 347a889b20 Read affiliation relations 2023-07-06 00:51:01 +03:00
Miriam Baglioni 8dcd028eed [UsageCount] fixed typo in attribute name for datasource table 2023-07-01 16:07:22 +02:00
Miriam Baglioni 7738372125 [UsageCount] fixed typo in attribute name for datasource table 2023-06-30 18:56:41 +02:00
Sandro La Bruzzo 9963fd6d29 updated log to add subentity 2023-06-28 13:36:05 +02:00
Claudio Atzori f3a85e224b merged from branch beta the bulk tagging (single step, negative constraints), the cleanig worflow (single step, pid type based cleaning), instance level fulltext 2023-06-28 13:33:57 +02:00
Sandro La Bruzzo ed7e2ab6d1 reverted mistake on commit workflow.xml 2023-06-28 11:40:19 +02:00
Sandro La Bruzzo 9910ce06ae added to CreateSimRel the feature to write time log 2023-06-28 11:38:16 +02:00
Miriam Baglioni 2717edafb7 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2023-06-28 11:25:14 +02:00
Miriam Baglioni 2f04c9d149 [BulkTagging] fixing left over for test 2023-06-28 11:24:42 +02:00
Sandro La Bruzzo bd17c3edc8 added to CreateSimRel the feature to write time log 2023-06-28 11:20:58 +02:00
Claudio Atzori 288ec0b7d6 [doiboost] merged workflow from branch beta 2023-06-28 09:15:37 +02:00
Claudio Atzori e10ce92fe5 [stats wf] merged workflows from branch beta 2023-06-27 14:32:48 +02:00
Claudio Atzori d029bf0b94 Merge branch 'master' into distinct_pids_from_openorgs 2023-06-27 12:24:35 +02:00
Serafeim Chatzopoulos 60f25b780d Minor fixes in workflow.xml and job.properties 2023-06-23 12:51:50 +03:00
Michele Artini 88a1cbc37d fixed a datasource id 2023-06-22 07:56:33 +02:00
Michele Artini 009d7f312f fixed a datasource Id 2023-06-21 16:17:34 +02:00
Claudio Atzori b0ebf56367 Merge pull request 'Update step15_5.sql' (#314) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#314
2023-06-21 10:33:22 +02:00
dimitrispie 2b6370eaee Update step15_5.sql
Bug fix
2023-06-21 11:31:10 +03:00
Claudio Atzori 35e42a86ed Merge pull request 'Update step15_5.sql' (#313) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#313
2023-06-21 10:26:16 +02:00
dimitrispie 74cb060bfe Update step15_5.sql
Add "if not exists" clause
2023-06-21 11:24:06 +03:00
Claudio Atzori 85e016df17 Merge pull request 'Update step16-createIndicatorsTables.sql' (#312) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#312
2023-06-21 09:52:33 +02:00
dimitrispie a475cfcb7b Update step16-createIndicatorsTables.sql
Rename a field in indi_pub_interdisciplinarity
2023-06-21 10:42:02 +03:00
Claudio Atzori 979cf9cd87 Merge pull request 'Update step15.sql' (#311) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#311
2023-06-21 09:20:01 +02:00
dimitrispie 4648cd88d4 Update step15.sql
Cast score to double
2023-06-21 10:02:19 +03:00
dimitrispie 94d2573c77 Update step15.sql
Bug Fix
2023-06-21 09:22:39 +03:00
Claudio Atzori 0561362de2 Merge pull request 'Update step20-createMonitorDB_institutions.sql' (#309) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#309
2023-06-20 15:07:09 +02:00
Claudio Atzori 50d7dc0078 [graph enrichment] fixed projectOrganizationPath not being passed to the apply_resulttoorganization_propagation node 2023-06-19 15:42:44 +02:00
Claudio Atzori fbd9bf704e indent 2023-06-19 15:41:22 +02:00
Giambattista Bloisi 758e662ab8 Revert "REmove duplicated code and ensure that load and initialization is done through "DedupConfig.load" method"
This reverts commit 485f9d18cb.
2023-06-19 13:08:10 +02:00
Giambattista Bloisi 485f9d18cb REmove duplicated code and ensure that load and initialization is done through "DedupConfig.load" method 2023-06-19 13:00:02 +02:00
dimitrispie be2caedb04 Update step20-createMonitorDB_institutions.sql
Add openorgs____::1624ff7c01bb641b91f4518539a0c28a Vrije Universiteit Amsterdam
2023-06-19 12:12:17 +03:00
dimitrispie 36e0a8fec4 Changes to Promotion Stats WF
1. Add new cluster host at impala-shell commands
2. Add a step for splitting monitor dbs
3. Update workflow.xml to included the new splitting monitor dbs step
2023-06-19 09:44:34 +03:00
dimitrispie 4c770a5e29 Update finalizeImpalaCluster.sh
Drop views in shadow dbs before dropping the db
2023-06-15 13:25:37 +03:00
dimitrispie e06d962a6a Update step15.sql 2023-06-15 12:20:35 +03:00
dimitrispie afcad08396 Update step20-createMonitorDB_institutions.sql
Added openorgs____::c0b262bd6eab819e4c994914f9c010e2   -- National Institute of Geophysics and Volcanology
2023-06-15 10:28:49 +03:00
Claudio Atzori b9748763e2 Merge pull request '[stats wf] Bug fixes' (#308) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#308
2023-06-14 21:57:03 +02:00
dimitrispie 42b8ce2ba4 Update copyDataToImpalaCluster.sh 2023-06-14 19:23:42 +03:00
dimitrispie 2032b0df40 Bug fixes
1. Remove tables/views from old databases in the new cluster, before dropping the dbs
2. Fix id in result_accessroute, indi_impact_measures, indi_pub_bronze_oa
2023-06-14 19:09:09 +03:00
Michele Artini a92206dab5 re-added the name of a column (pid) 2023-06-13 11:43:10 +02:00
Claudio Atzori b76a47b103 [aggregator graph] added column alias when mapping organization PIDs from the OpenOrgs database 2023-06-13 11:38:10 +02:00
Claudio Atzori ad04f14b81 Merge branch 'beta' into distinct_pids_from_openorgs_beta 2023-06-12 09:58:21 +02:00
Claudio Atzori 55f002f1e9 Merge branch 'beta' into propagationProjectThroughParentChils 2023-06-12 09:56:53 +02:00
Claudio Atzori 4b00a76271 Merge branch 'beta' into fulltext_url_validation 2023-06-12 09:55:25 +02:00
Claudio Atzori de225c71cd Merge branch 'beta' into removeTaggingCondition 2023-06-12 09:50:40 +02:00
Claudio Atzori e1409ffe80 update sql query to return distinct pids 2023-06-12 09:47:45 +02:00
Claudio Atzori da7b66c542 Merge pull request '[stats wf] Added memory to hive' (#305) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#305
2023-06-08 08:58:48 +02:00
dimitrispie c5f42c7f5b Added memory to hive 2023-06-07 18:18:23 +03:00
Claudio Atzori afb76ebf0f Merge pull request '[stats wf] Bug fix on indicators step' (#304) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#304
2023-06-07 16:49:09 +02:00
dimitrispie fa24e2e18f Bug fix on indicators step
indi_pub_gold_oa table was missing during the creation of other indicators
2023-06-07 17:43:37 +03:00
Claudio Atzori 01c67e697d Merge pull request '[ stats wf] Bug fix' (#303) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#303
2023-06-07 14:41:44 +02:00
dimitrispie 28272c1b0e Bug fix 2023-06-07 15:34:01 +03:00
Alessia Bardi d5be6a13e9 Updated officialnmae of pangaea in hostedbymap for Datacite to avoid duplicate entries in the source filter of the portal 2023-06-06 14:43:32 +02:00
Alessia Bardi 118e72d7db Updated officialnmae of pangaea in hostedbymap for Datacite to avoid duplicate entries in the source filter of the portal 2023-06-06 14:39:12 +02:00
Alessia Bardi 5befd93d7d test records for Solr indexing 2023-06-06 14:34:33 +02:00
Michele Artini cae92cf811 update sql query to return distinct pids 2023-06-06 14:06:06 +02:00
Claudio Atzori 8f651f1225 Merge pull request 'Changes to beta stats wf' (#300) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#300
2023-06-06 11:41:36 +02:00
dimitrispie ad07fbf053 Add names to organizations for collaboration indicators 2023-06-02 14:13:10 +03:00
dimitrispie 2324670714 Split Monitor DBs-Interdisciplinarity indicators
- Split DBs Monitor for faster rendering of visualizations
- Add interdisciplinarity indicators from result_fos
2023-06-02 13:34:16 +03:00
Miriam Baglioni daf4d7971b refactoring 2023-05-31 18:56:58 +02:00
Miriam Baglioni 97d72d41c3 finalization of implementation and testing 2023-05-31 18:53:22 +02:00
Miriam Baglioni 0389b57ca7 added propagation for project to organization 2023-05-31 11:06:58 +02:00
Claudio Atzori e45777e7e1 [aggregator graph] added validation for URLs mapped from oaf:fulltext 2023-05-26 11:33:42 +02:00
dimitrispie ebe586b1d1 Impact indicators/Unpaywall
- Added Impact indicators
- Added unpaywall open access colours
2023-05-26 10:25:28 +03:00
dimitrispie d6102dd576 Update step16-createIndicatorsTables.sql
- Add org names to indi_project_collab_org
- Add indi_pub_bronze_oa
 - Changes to indi_pub_hybrid_oa_with_cc
2023-05-25 14:52:34 +03:00
Miriam Baglioni 9097e71853 Added assertion in test 2023-05-24 16:30:53 +02:00
Miriam Baglioni 9567c13bc3 refactoring 2023-05-24 16:20:05 +02:00
Miriam Baglioni 34172455d1 [BulkTag] Adding remove constraints to specify when a community must not appear in the context of a result. 2023-05-24 09:56:23 +02:00
Ilias Kanellos a1b9187039 Fix syntax error on workflow.xml 2023-05-23 17:17:12 +03:00
Ilias Kanellos 6a7e370a21 Remove unnecessary counts in graph creation 2023-05-23 16:48:58 +03:00
Ilias Kanellos ec4e010687 End after rankings | Create graph debugged 2023-05-23 16:44:04 +03:00
Claudio Atzori db625e548d [UsageCount] addition of usagecount for Projects and datasources 2023-05-22 15:00:46 +02:00
Alessia Bardi 04141fe259 tests for records from D4Science catalogues 2023-05-19 14:28:24 +02:00
Claudio Atzori a235d2a24a Merge pull request 'Updates to steps related to transfer data to impala cluster' (#295) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#295
2023-05-18 08:46:15 +02:00
dimitrispie 86f4f63daf Updates to steps related to transfer data to impala cluster
1. Remove external table definitions in stats_ext
2. Fix the issue where some views are not created.
3. Added two workflow parameters for copying also the usage stats dbs
2023-05-18 09:33:05 +03:00
Claudio Atzori 909729a2fc [dedup] tweaking num partitions, minor changes 2023-05-17 10:16:22 +02:00
Ilias Kanellos 38020e242a Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-05-16 17:34:53 +03:00
Ilias Kanellos 3d69f33c84 Fix selection of columns in graph creation 2023-05-16 17:34:42 +03:00
Ilias Kanellos 3c38f7ba6f Fix selection of columns in graph creation 2023-05-16 17:32:53 +03:00
Serafeim Chatzopoulos 8ef718c363 Fix workflow application path 2023-05-16 16:28:48 +03:00
Serafeim Chatzopoulos 26328e2a0d Move job.properties 2023-05-16 14:39:53 +03:00
Serafeim Chatzopoulos 4eec3e7052 Add jobTracker, nameNode && spark2Lib as global params in oozie wf 2023-05-15 22:28:48 +03:00
Serafeim Chatzopoulos b83135c252 Add missing kill nodes in workflow.xml 2023-05-15 19:55:35 +03:00
Serafeim Chatzopoulos 45f2aa0867 Move end node ... at the end in workflow.xml 2023-05-15 17:52:20 +03:00
Claudio Atzori 8acad52a0c Merge branch 'beta' into apc_affiliation 2023-05-15 15:47:33 +02:00
Claudio Atzori 8a463cc3e8 fixed organization id created when mapping APC affiliations. Factored out ROR constants in dhp-common 2023-05-15 15:44:46 +02:00
Serafeim Chatzopoulos 12a57e1f58 Resolve conflicts 2023-05-15 16:20:11 +03:00
Serafeim Chatzopoulos 82e2a96f51 Resolve conflicts 2023-05-15 15:53:12 +03:00
Serafeim Chatzopoulos b8e8c959fe Update workflow.xml && job.properties 2023-05-15 15:50:23 +03:00
Ilias Kanellos 4a905932a3 Spark properties from job.properties 2023-05-15 15:24:22 +03:00
Claudio Atzori 0c314d5e09 Merge pull request 'Update copyDataToImpalaCluster.sh' (#293) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#293
2023-05-15 12:05:54 +02:00
Serafeim Chatzopoulos 07818131ef Update documentation 2023-05-15 13:04:44 +03:00
dimitrispie b3f9633205 Update copyDataToImpalaCluster.sh
Added option --user to impala-shell command
2023-05-15 12:51:44 +03:00
Miriam Baglioni 78b07400c0 changed test classes 2023-05-15 11:37:08 +02:00
Miriam Baglioni 86fe886c1a removed the inverse of the Citing relation 2023-05-15 11:20:51 +02:00
Ilias Kanellos 1788ac2d4d Correct filtering for MAG records 2023-05-12 12:55:43 +03:00
Miriam Baglioni 12cd179d2d Merge pull request 'Update copyDataToImpalaCluster.sh' (#291) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#291
2023-05-12 11:36:34 +02:00
dimitrispie 00d0d162b6 Update copyDataToImpalaCluster.sh
Added a temporary folder to copy the files to avoid permission issues
2023-05-12 12:31:13 +03:00
Ilias Kanellos 5ddbb4ad10 Spark properties no longer hardcoded 2023-05-11 15:36:47 +03:00
Ilias Kanellos 3de35fd6a3 Produce 5 classes of ranking scores 2023-05-11 14:42:25 +03:00
Miriam Baglioni 8c05f49665 moved the version as it was before the change 2023-05-09 10:48:34 +02:00