Michele Artini
35e6e9c064
tests
2020-07-28 12:02:15 +02:00
Claudio Atzori
ee832f358e
Merge pull request 'stats_wf_extensions_and_corrections' ( #28 ) from spyros/dnet-hadoop:stats_wf_extensions_and_corrections into master
...
Thank you Guys! The update workflow will be made available to the beta & production orchestration systems under the HDFS path
```/lib/dnet/oa/graph/stats/oozie_app```
2020-07-27 16:02:03 +02:00
Antonis Lempesis
4ac8ebe427
correctly calculating the project duration
2020-07-24 19:50:40 +03:00
Antonis Lempesis
18d9464b52
creating shadow db only if it not exists...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
e217d496ab
added the dest db...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
b16bb68b9f
added the target db name...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
1ee7eeedf3
added the source db name...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
cecbbfa0fc
added missing tables and views: contexts, creation_date, funder
2020-07-24 19:50:40 +03:00
Antonis Lempesis
25b7a615f5
moved datasource_sources table creating in the datasource section
2020-07-24 19:50:40 +03:00
Antonis Lempesis
a8da4ab9c0
years in projects are now integers
2020-07-24 19:50:40 +03:00
Antonis Lempesis
c9cfc165d9
not using impala since the resulting tables are not visible
2020-07-24 19:50:40 +03:00
Antonis Lempesis
dd3d6a6e15
compute stats for the used and new impala tables
2020-07-24 19:50:40 +03:00
Antonis Lempesis
e6f50de6ef
Separated impala from hive steps
2020-07-24 19:50:40 +03:00
Antonis Lempesis
de49173420
fixed a typo in queries
2020-07-24 19:50:40 +03:00
antleb
391cf80fb8
Added peer-reviewed, green, gold tables and fields in result. Added shortcuts from result-country
2020-07-24 19:50:40 +03:00
antleb
68389d0125
Corrected the script used by the last step of the wf
2020-07-24 19:50:40 +03:00
antleb
ec52141f1a
changed refereed type from value to clssname
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
63cd797aba
Comment out step 15 to make it work with the new schema of Claudio
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
138c6ddffa
Insert statement to datasource table that takes into account the piwik_id of the openAIRE graph
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
3630794cef
Fix to consider the relationships that have been 'virtually deleted' for project_results - defect #5607
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
5546f29e63
Corrections on the shadow schema and the impala table stats calculation
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
adf8a025d2
Adding more relations (Sources, Licences, Additional) and shadow schema as provided and discussed with Antonis Lempesis
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
657a40536b
Corrections by Spyros: Scipt cleanup, corrections and re-arrangement
2020-07-24 19:50:40 +03:00
Giorgos Alexiou
477fa6234d
Script re-organisation and adding table invalidations needed for impala
2020-07-24 19:50:40 +03:00
Miriam Baglioni
6c2223d1fc
added code to get the openaire id for contexts
2020-07-24 17:30:15 +02:00
Miriam Baglioni
afd54c1684
removed not needed upload and refactoring
2020-07-24 17:28:56 +02:00
Miriam Baglioni
7b0569d989
changed to map also the result associated to the whole graph
2020-07-24 17:28:11 +02:00
Miriam Baglioni
082225ad61
-
2020-07-24 17:27:26 +02:00
Miriam Baglioni
968c59d97a
added teh logic to dump also the products for the whole graph. They will miss collected from and context information that will be materialized as new relations
2020-07-24 17:25:19 +02:00
Miriam Baglioni
332258d199
split the classes related to the communities dump and to the whole graph dump
2020-07-24 17:21:48 +02:00
Claudio Atzori
56bbfdc65d
introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'
2020-07-23 08:54:10 +02:00
Sandro La Bruzzo
9ab594ccf6
fixed test
2020-07-21 10:36:21 +02:00
Claudio Atzori
ebf60020ac
map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type
2020-07-20 19:01:10 +02:00
Miriam Baglioni
355d7e426e
added dumo for project - not finished
2020-07-20 18:54:43 +02:00
Miriam Baglioni
a2f01e5259
added getter and setter
2020-07-20 18:54:17 +02:00
Miriam Baglioni
40bbe94f7c
merge with master fork
2020-07-20 18:10:03 +02:00
Miriam Baglioni
2a15494b16
merge upstream
2020-07-20 18:05:01 +02:00
Miriam Baglioni
23160b4d29
realignment of the workflow classes with the changes in the structure of the module
2020-07-20 18:04:30 +02:00
Miriam Baglioni
b904e0699a
-
2020-07-20 18:02:53 +02:00
Miriam Baglioni
3aab7680f6
changed the test results
2020-07-20 18:00:43 +02:00
Miriam Baglioni
cde0300801
moved from projects to project
2020-07-20 17:57:35 +02:00
Miriam Baglioni
5076e4f320
changed test to comply with the modifications
2020-07-20 17:55:18 +02:00
Miriam Baglioni
08dbd99455
changed to dump the whole results graph by usign classes already implemented for communities. Added class to dump also organization
2020-07-20 17:54:28 +02:00
Miriam Baglioni
e47ea9349c
extended some types by adding provenance as the couple (provenance, trust) and moved some classes to be used by the complete graph dump also
2020-07-20 17:46:27 +02:00
Claudio Atzori
32f5e466e3
imports cleanup
2020-07-20 17:42:58 +02:00
Claudio Atzori
54ac583923
code formatting
2020-07-20 17:37:08 +02:00
Claudio Atzori
124e7ce19c
in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available
2020-07-20 17:33:37 +02:00
Claudio Atzori
050dda223d
Merge pull request 'removed duplicated fields' ( #25 ) from unique_field_in_lists into master
...
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.
The task to update the model in such a way is added on #9#issuecomment-1583
Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori
e0c4cf6f7b
added parameter to drive the graph merge strategy: priority (BETA|PROD)
2020-07-20 10:48:01 +02:00
Claudio Atzori
94ccdb4852
Merge branch 'master' into merge_graph
2020-07-20 10:14:55 +02:00
Claudio Atzori
0937c9998f
Merge branch 'deduptesting'
2020-07-20 10:00:20 +02:00
Claudio Atzori
de72b1c859
cleanup
2020-07-20 09:59:11 +02:00
Michele Artini
331a3cbdd0
fixed originalId
2020-07-20 09:50:29 +02:00
Michele Artini
c59c5369b1
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-18 09:40:54 +02:00
Michele Artini
346a1d2b5a
update eventId generator
2020-07-18 09:40:36 +02:00
Sandro La Bruzzo
9116d75b3e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-17 18:01:30 +02:00
Miriam Baglioni
d7d84c8217
-
2020-07-17 14:03:23 +02:00
Miriam Baglioni
47c7122773
changed priority from beta to production
2020-07-17 12:56:35 +02:00
Michele Artini
442f30930c
removed duplicated fields
2020-07-17 12:25:36 +02:00
Claudio Atzori
1781609508
code formatting
2020-07-16 19:06:56 +02:00
Claudio Atzori
db8b90a156
renamed CORE -> BETA
2020-07-16 19:05:13 +02:00
Miriam Baglioni
44e1c40c42
merge upstream
2020-07-16 18:49:38 +02:00
Claudio Atzori
878f2b931c
Merge branch 'master' into merge_graph
2020-07-16 16:34:24 +02:00
Claudio Atzori
cc5d13da85
introduced parameter shouldIndex (true|false)
2020-07-16 13:46:39 +02:00
Claudio Atzori
b098cc3cbe
avoid repeating identical values for fields: source, description
2020-07-16 13:45:53 +02:00
Claudio Atzori
805de4eca1
fix: filter the blocks with size = 1
2020-07-16 10:11:32 +02:00
Claudio Atzori
4b9fb2ffb8
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-15 11:26:04 +02:00
Claudio Atzori
b90389bac4
code formatting
2020-07-15 11:24:48 +02:00
Claudio Atzori
4e6f46e8fa
filter blocks with one record only
2020-07-15 11:22:20 +02:00
Michele Artini
262c29463e
relations with multiple datasources
2020-07-15 09:18:40 +02:00
Claudio Atzori
7d6e269b40
reverted CreateRelatedEntitiesJob_phase1 to its previous state
2020-07-13 22:54:04 +02:00
Claudio Atzori
8e97598eb4
avoid to NPE in case of null instances
2020-07-13 20:46:14 +02:00
Claudio Atzori
06def0c0cb
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
2020-07-13 20:09:06 +02:00
miconis
b52c246aed
merge done
2020-07-13 19:57:02 +02:00
miconis
b8a45041fd
minor changes
2020-07-13 19:53:18 +02:00
Claudio Atzori
66f9f6d323
adjusted parameters for the dedup stats workflow
2020-07-13 19:26:46 +02:00
miconis
03ecfa5ebd
implementation of the test class for the new block stats spark action
2020-07-13 18:48:23 +02:00
miconis
10e08ccf45
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-13 18:22:45 +02:00
miconis
9258e4f095
implementation of a new workflow to compute statistics on the blocks
2020-07-13 18:22:34 +02:00
Claudio Atzori
c6f6fb0f28
code formatting
2020-07-13 16:46:13 +02:00
Claudio Atzori
8d2102d7d2
Merge branch 'deduptesting'
2020-07-13 16:32:43 +02:00
Claudio Atzori
344a90c2e6
updated assertions in propagateRelationTest
2020-07-13 16:32:04 +02:00
Claudio Atzori
1143f426aa
WIP SparkCreateMergeRels distinct relations
2020-07-13 16:13:36 +02:00
Claudio Atzori
8c67938ad0
configurable number of partitions used in the SparkCreateSimRels phase
2020-07-13 16:07:07 +02:00
Claudio Atzori
c73168b18e
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
2020-07-13 15:54:58 +02:00
Claudio Atzori
c8284bab06
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:54:51 +02:00
Sandro La Bruzzo
1d133b7fe6
update test
2020-07-13 15:52:41 +02:00
Michele Artini
3635d05061
poms
2020-07-13 15:52:23 +02:00
Claudio Atzori
7dd91edf43
parsing of optional parameter
2020-07-13 15:40:41 +02:00
Claudio Atzori
4c101a9d66
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:31:38 +02:00
Claudio Atzori
8a612d861a
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:30:57 +02:00
Sandro La Bruzzo
9ef2385022
implemented test for cut of connected component
2020-07-13 15:28:17 +02:00
Sandro La Bruzzo
d561b2dd21
implemented cut of connected component
2020-07-13 14:18:42 +02:00
Miriam Baglioni
8e0e090d7a
merge upstream
2020-07-13 12:46:55 +02:00
Claudio Atzori
e2093e42db
Merge branch 'master' into deduptesting
2020-07-13 10:57:49 +02:00
Michele Artini
2c4ed9a043
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-13 10:55:39 +02:00
Michele Artini
ccbe5c5658
fixed import of eu.dnetlib.dhp:dnet-openaire-broker-common
2020-07-13 10:55:27 +02:00
Claudio Atzori
7a3fd9f54c
dedup relation aggregator moved into dedicated class
2020-07-13 10:11:36 +02:00
Alessia Bardi
7e96105947
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-12 19:29:12 +02:00
Alessia Bardi
b7a39731a6
assert, not print
2020-07-12 19:28:56 +02:00
Miriam Baglioni
f9ad6f3255
Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump
2020-07-10 19:42:53 +02:00
Miriam Baglioni
c27f12d6e8
avoid to consider _SUCCESS file
2020-07-10 19:42:23 +02:00
Claudio Atzori
770adc26e9
WIP aggregator to make relationships unique
2020-07-10 19:35:10 +02:00
Claudio Atzori
ecf119f37a
Merge branch 'master' into deduptesting
2020-07-10 19:04:16 +02:00
Claudio Atzori
31071e363f
Merge branch 'provision_indexing'
2020-07-10 19:03:57 +02:00
Claudio Atzori
06c1913062
added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations
2020-07-10 19:03:33 +02:00
Claudio Atzori
cc77446dc4
added dbSchema parameter to the raw_db workflow
2020-07-10 19:01:50 +02:00
Claudio Atzori
4c3836f62e
materialize the related entities before joining them
2020-07-10 19:00:44 +02:00
Michele Artini
e1ae964bc4
stats
2020-07-10 16:12:08 +02:00
Claudio Atzori
752d28f8eb
make the relations produced by the dedup SparkPropagateRelation jon unique
2020-07-10 15:09:50 +02:00
Sandro La Bruzzo
c01efed79b
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-10 14:44:57 +02:00
Sandro La Bruzzo
a7d3977481
added generation of EBI Dataset
2020-07-10 14:44:50 +02:00
Claudio Atzori
b21866a2da
allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types
2020-07-10 13:59:48 +02:00
Claudio Atzori
ff4d6214f1
experimenting with pruning of relations
2020-07-10 10:06:41 +02:00
Miriam Baglioni
faea30cda0
-
2020-07-09 14:05:21 +02:00
Michele Artini
2d742a84ae
DedupConfig as json file
2020-07-09 12:53:46 +02:00
Miriam Baglioni
a634794242
merge upstream
2020-07-09 11:46:51 +02:00
Michele Artini
a44b9b36b9
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-09 11:02:31 +02:00
Michele Artini
1c6a171633
updated pom
2020-07-09 11:02:09 +02:00
Claudio Atzori
3c728aaa0c
trying to overcome OOM errors during duplicate scan phase
2020-07-08 22:39:51 +02:00
Claudio Atzori
18c555cd79
Merge branch 'master' into deduptesting
2020-07-08 22:32:01 +02:00
Claudio Atzori
4365cf41d7
trying to overcome OOM errors during duplicate scan phase
2020-07-08 22:31:46 +02:00
Claudio Atzori
67e1d222b6
bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances
2020-07-08 17:53:35 +02:00
Alessia Bardi
853e8d7987
test for software merge
2020-07-08 17:03:53 +02:00
Claudio Atzori
610d377d57
first implementation of the BETA & PROD graphs merge procedure
2020-07-08 16:54:26 +02:00
Alessia Bardi
9a898c0e4c
Json schema generator
2020-07-08 12:52:00 +02:00
Alessia Bardi
636f9ce7d6
json schema generator lib
2020-07-08 12:50:57 +02:00
Alessia Bardi
8f83b726fa
Dump json schema compliant to json schema Draft 7
2020-07-08 12:48:46 +02:00
Claudio Atzori
e2ea30f89d
updated graph construction workflow definition: cleaning wf moved at the bottom to include cleaning of the information produced by the enrichment workflows
2020-07-08 12:16:24 +02:00
Miriam Baglioni
1b0b968548
fixed issue on substring
2020-07-08 12:11:51 +02:00
Miriam Baglioni
7fe00cb4fb
-
2020-07-08 10:29:37 +02:00
Miriam Baglioni
375ef07d7b
changed the description for the upload
2020-07-07 18:41:27 +02:00
Miriam Baglioni
35c8265793
added the json extention to filename
2020-07-07 18:29:49 +02:00
Miriam Baglioni
81434f8e5e
added method newInstance
2020-07-07 18:26:10 +02:00
Miriam Baglioni
817cddfc52
-
2020-07-07 18:25:12 +02:00
Miriam Baglioni
a66aa9bd83
removed unuseful tests
2020-07-07 18:25:00 +02:00
Miriam Baglioni
9b20a21b24
removed unuseful tests
2020-07-07 18:23:37 +02:00
Miriam Baglioni
8a1b42ff21
added check to verify that dump contains at least one product
2020-07-07 18:21:35 +02:00
Miriam Baglioni
d86adb82a7
-
2020-07-07 18:20:51 +02:00
Miriam Baglioni
b2782025f6
enabled the whole workflow to run. Added property to give priority to depenedency in the classpath - to solve conflicts
2020-07-07 18:10:47 +02:00
Miriam Baglioni
83d2c84b77
added constraints to xquery so that to get only profiles with status manager or all
2020-07-07 18:09:48 +02:00
Miriam Baglioni
4c8d86493c
-
2020-07-07 18:09:06 +02:00
Miriam Baglioni
0208bc18f3
added new resource for testing
2020-07-07 17:47:24 +02:00
Miriam Baglioni
f5bb65c9ef
the json schema for the dump of the results
2020-07-07 17:34:40 +02:00
Michele Artini
dffa0b01a2
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-07 15:37:29 +02:00
Michele Artini
efadbdb2bc
fixed a bug with duplicated events
2020-07-07 15:37:13 +02:00
Claudio Atzori
8af8e7481a
code formatting
2020-07-07 14:23:34 +02:00
Claudio Atzori
b383ed42fa
pass optional parameter relationFilter to the PrepareRelationJob implementation
2020-07-07 14:21:28 +02:00
Claudio Atzori
911894a987
Merge branch 'deduptesting'
2020-07-07 14:20:43 +02:00
Miriam Baglioni
c19818a3f8
merge branch with fork master
2020-07-06 13:58:23 +02:00