Commit Graph

1388 Commits

Author SHA1 Message Date
Antonis Lempesis dd3d6a6e15 compute stats for the used and new impala tables 2020-07-24 19:50:40 +03:00
Antonis Lempesis e6f50de6ef Separated impala from hive steps 2020-07-24 19:50:40 +03:00
Antonis Lempesis de49173420 fixed a typo in queries 2020-07-24 19:50:40 +03:00
antleb 391cf80fb8 Added peer-reviewed, green, gold tables and fields in result. Added shortcuts from result-country 2020-07-24 19:50:40 +03:00
antleb 68389d0125 Corrected the script used by the last step of the wf 2020-07-24 19:50:40 +03:00
antleb ec52141f1a changed refereed type from value to clssname 2020-07-24 19:50:40 +03:00
Spyros Zoupanos 63cd797aba Comment out step 15 to make it work with the new schema of Claudio 2020-07-24 19:50:40 +03:00
Spyros Zoupanos 138c6ddffa Insert statement to datasource table that takes into account the piwik_id of the openAIRE graph 2020-07-24 19:50:40 +03:00
Spyros Zoupanos 3630794cef Fix to consider the relationships that have been 'virtually deleted' for project_results - defect #5607 2020-07-24 19:50:40 +03:00
Spyros Zoupanos 5546f29e63 Corrections on the shadow schema and the impala table stats calculation 2020-07-24 19:50:40 +03:00
Spyros Zoupanos adf8a025d2 Adding more relations (Sources, Licences, Additional) and shadow schema as provided and discussed with Antonis Lempesis 2020-07-24 19:50:40 +03:00
Spyros Zoupanos 657a40536b Corrections by Spyros: Scipt cleanup, corrections and re-arrangement 2020-07-24 19:50:40 +03:00
Giorgos Alexiou 477fa6234d Script re-organisation and adding table invalidations needed for impala 2020-07-24 19:50:40 +03:00
Claudio Atzori 56bbfdc65d introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project' 2020-07-23 08:54:10 +02:00
Sandro La Bruzzo 9ab594ccf6 fixed test 2020-07-21 10:36:21 +02:00
Claudio Atzori ebf60020ac map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type 2020-07-20 19:01:10 +02:00
Claudio Atzori 32f5e466e3 imports cleanup 2020-07-20 17:42:58 +02:00
Claudio Atzori 54ac583923 code formatting 2020-07-20 17:37:08 +02:00
Claudio Atzori 124e7ce19c in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available 2020-07-20 17:33:37 +02:00
Claudio Atzori 050dda223d Merge pull request 'removed duplicated fields' (#25) from unique_field_in_lists into master
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.

The task to update the model in such a way is added on #9#issuecomment-1583

Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori e0c4cf6f7b added parameter to drive the graph merge strategy: priority (BETA|PROD) 2020-07-20 10:48:01 +02:00
Claudio Atzori 94ccdb4852 Merge branch 'master' into merge_graph 2020-07-20 10:14:55 +02:00
Claudio Atzori 0937c9998f Merge branch 'deduptesting' 2020-07-20 10:00:20 +02:00
Claudio Atzori 105176105c updated dnet-pace-core dependency to version 4.0.4 to include the latest clustering function 2020-07-20 09:59:47 +02:00
Claudio Atzori de72b1c859 cleanup 2020-07-20 09:59:11 +02:00
Michele Artini 331a3cbdd0 fixed originalId 2020-07-20 09:50:29 +02:00
Michele Artini c59c5369b1 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-18 09:40:54 +02:00
Michele Artini 346a1d2b5a update eventId generator 2020-07-18 09:40:36 +02:00
Sandro La Bruzzo 9116d75b3e Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-17 18:01:30 +02:00
Miriam Baglioni 47c7122773 changed priority from beta to production 2020-07-17 12:56:35 +02:00
Michele Artini 442f30930c removed duplicated fields 2020-07-17 12:25:36 +02:00
Claudio Atzori 1781609508 code formatting 2020-07-16 19:06:56 +02:00
Claudio Atzori db8b90a156 renamed CORE -> BETA 2020-07-16 19:05:13 +02:00
Claudio Atzori 878f2b931c Merge branch 'master' into merge_graph 2020-07-16 16:34:24 +02:00
Claudio Atzori cc5d13da85 introduced parameter shouldIndex (true|false) 2020-07-16 13:46:39 +02:00
Claudio Atzori b098cc3cbe avoid repeating identical values for fields: source, description 2020-07-16 13:45:53 +02:00
Claudio Atzori 805de4eca1 fix: filter the blocks with size = 1 2020-07-16 10:11:32 +02:00
Claudio Atzori 4b9fb2ffb8 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-15 11:26:04 +02:00
Claudio Atzori 5033c25587 code formatting 2020-07-15 11:26:00 +02:00
Claudio Atzori b90389bac4 code formatting 2020-07-15 11:24:48 +02:00
Claudio Atzori 4e6f46e8fa filter blocks with one record only 2020-07-15 11:22:20 +02:00
Michele Artini 262c29463e relations with multiple datasources 2020-07-15 09:18:40 +02:00
Claudio Atzori 7d6e269b40 reverted CreateRelatedEntitiesJob_phase1 to its previous state 2020-07-13 22:54:04 +02:00
Claudio Atzori 8e97598eb4 avoid to NPE in case of null instances 2020-07-13 20:46:14 +02:00
Claudio Atzori 06def0c0cb SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter 2020-07-13 20:09:06 +02:00
miconis b52c246aed merge done 2020-07-13 19:57:02 +02:00
miconis b8a45041fd minor changes 2020-07-13 19:53:18 +02:00
Claudio Atzori 66f9f6d323 adjusted parameters for the dedup stats workflow 2020-07-13 19:26:46 +02:00
miconis 03ecfa5ebd implementation of the test class for the new block stats spark action 2020-07-13 18:48:23 +02:00
miconis 10e08ccf45 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-13 18:22:45 +02:00