Spyros Zoupanos
63cd797aba
Comment out step 15 to make it work with the new schema of Claudio
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
138c6ddffa
Insert statement to datasource table that takes into account the piwik_id of the openAIRE graph
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
3630794cef
Fix to consider the relationships that have been 'virtually deleted' for project_results - defect #5607
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
5546f29e63
Corrections on the shadow schema and the impala table stats calculation
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
adf8a025d2
Adding more relations (Sources, Licences, Additional) and shadow schema as provided and discussed with Antonis Lempesis
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
657a40536b
Corrections by Spyros: Scipt cleanup, corrections and re-arrangement
2020-07-24 19:50:40 +03:00
Giorgos Alexiou
477fa6234d
Script re-organisation and adding table invalidations needed for impala
2020-07-24 19:50:40 +03:00
Claudio Atzori
56bbfdc65d
introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'
2020-07-23 08:54:10 +02:00
Sandro La Bruzzo
9ab594ccf6
fixed test
2020-07-21 10:36:21 +02:00
Claudio Atzori
ebf60020ac
map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type
2020-07-20 19:01:10 +02:00
Claudio Atzori
32f5e466e3
imports cleanup
2020-07-20 17:42:58 +02:00
Claudio Atzori
54ac583923
code formatting
2020-07-20 17:37:08 +02:00
Claudio Atzori
124e7ce19c
in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available
2020-07-20 17:33:37 +02:00
Claudio Atzori
050dda223d
Merge pull request 'removed duplicated fields' ( #25 ) from unique_field_in_lists into master
...
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.
The task to update the model in such a way is added on #9#issuecomment-1583
Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori
e0c4cf6f7b
added parameter to drive the graph merge strategy: priority (BETA|PROD)
2020-07-20 10:48:01 +02:00
Claudio Atzori
94ccdb4852
Merge branch 'master' into merge_graph
2020-07-20 10:14:55 +02:00
Claudio Atzori
0937c9998f
Merge branch 'deduptesting'
2020-07-20 10:00:20 +02:00
Claudio Atzori
105176105c
updated dnet-pace-core dependency to version 4.0.4 to include the latest clustering function
2020-07-20 09:59:47 +02:00
Claudio Atzori
de72b1c859
cleanup
2020-07-20 09:59:11 +02:00
Michele Artini
331a3cbdd0
fixed originalId
2020-07-20 09:50:29 +02:00
Michele Artini
c59c5369b1
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-18 09:40:54 +02:00
Michele Artini
346a1d2b5a
update eventId generator
2020-07-18 09:40:36 +02:00
Sandro La Bruzzo
9116d75b3e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-17 18:01:30 +02:00
Miriam Baglioni
47c7122773
changed priority from beta to production
2020-07-17 12:56:35 +02:00
Michele Artini
442f30930c
removed duplicated fields
2020-07-17 12:25:36 +02:00
Claudio Atzori
1781609508
code formatting
2020-07-16 19:06:56 +02:00
Claudio Atzori
db8b90a156
renamed CORE -> BETA
2020-07-16 19:05:13 +02:00
Claudio Atzori
878f2b931c
Merge branch 'master' into merge_graph
2020-07-16 16:34:24 +02:00
Claudio Atzori
cc5d13da85
introduced parameter shouldIndex (true|false)
2020-07-16 13:46:39 +02:00
Claudio Atzori
b098cc3cbe
avoid repeating identical values for fields: source, description
2020-07-16 13:45:53 +02:00
Claudio Atzori
805de4eca1
fix: filter the blocks with size = 1
2020-07-16 10:11:32 +02:00
Claudio Atzori
4b9fb2ffb8
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-15 11:26:04 +02:00
Claudio Atzori
5033c25587
code formatting
2020-07-15 11:26:00 +02:00
Claudio Atzori
b90389bac4
code formatting
2020-07-15 11:24:48 +02:00
Claudio Atzori
4e6f46e8fa
filter blocks with one record only
2020-07-15 11:22:20 +02:00
Michele Artini
262c29463e
relations with multiple datasources
2020-07-15 09:18:40 +02:00
Claudio Atzori
7d6e269b40
reverted CreateRelatedEntitiesJob_phase1 to its previous state
2020-07-13 22:54:04 +02:00
Claudio Atzori
8e97598eb4
avoid to NPE in case of null instances
2020-07-13 20:46:14 +02:00
Claudio Atzori
06def0c0cb
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
2020-07-13 20:09:06 +02:00
miconis
b52c246aed
merge done
2020-07-13 19:57:02 +02:00
miconis
b8a45041fd
minor changes
2020-07-13 19:53:18 +02:00
Claudio Atzori
66f9f6d323
adjusted parameters for the dedup stats workflow
2020-07-13 19:26:46 +02:00
miconis
03ecfa5ebd
implementation of the test class for the new block stats spark action
2020-07-13 18:48:23 +02:00
miconis
10e08ccf45
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-13 18:22:45 +02:00
miconis
9258e4f095
implementation of a new workflow to compute statistics on the blocks
2020-07-13 18:22:34 +02:00
Claudio Atzori
c6f6fb0f28
code formatting
2020-07-13 16:46:13 +02:00
Claudio Atzori
8d2102d7d2
Merge branch 'deduptesting'
2020-07-13 16:32:43 +02:00
Claudio Atzori
344a90c2e6
updated assertions in propagateRelationTest
2020-07-13 16:32:04 +02:00
Claudio Atzori
1143f426aa
WIP SparkCreateMergeRels distinct relations
2020-07-13 16:13:36 +02:00
Claudio Atzori
8c67938ad0
configurable number of partitions used in the SparkCreateSimRels phase
2020-07-13 16:07:07 +02:00