Alessia Bardi
c35bf486cc
added handle among the possible PIDs
2020-08-04 12:50:12 +02:00
Miriam Baglioni
5b651abf82
merge branch with master
2020-08-04 10:14:07 +02:00
Miriam Baglioni
88e4c3b751
added default trust to context bulktagged
2020-08-04 10:13:25 +02:00
Miriam Baglioni
f9342cb484
added constant
2020-08-03 18:32:35 +02:00
Miriam Baglioni
96c3c891f4
added trust
2020-08-03 18:32:17 +02:00
Miriam Baglioni
53656600ad
changed XQuery to select only community and ri with status not hidden
2020-08-03 18:29:30 +02:00
Miriam Baglioni
b34177d8ef
merge upstream
2020-08-03 18:13:42 +02:00
Miriam Baglioni
901ae37f7b
added step to workflow
2020-08-03 18:12:54 +02:00
Miriam Baglioni
fa38cdb10b
added resource
2020-08-03 18:11:12 +02:00
Miriam Baglioni
e9fcc0b2f1
commented test unit - to decide change for mirroring the changed logics
2020-08-03 18:10:53 +02:00
Miriam Baglioni
e43aeb139a
added new property file and changed some parameter to old files
2020-08-03 18:07:28 +02:00
Miriam Baglioni
aa9f3d9698
changed logic for save in s3 directly
2020-08-03 18:06:18 +02:00
Miriam Baglioni
d465f0eec9
added fulltext to result
2020-08-03 18:03:27 +02:00
Miriam Baglioni
ec4b392d12
added new dependencies for writing on s3
2020-08-03 17:57:04 +02:00
Miriam Baglioni
c892c7dfa7
changed to query for community map just once and save the result for remaining executions
2020-08-03 17:56:31 +02:00
Claudio Atzori
3a11a387a9
data provision workflow enhancement: added nodes to perform DELETE BY QUERY before the indexing begins and COMMIT after the indexing is completed
2020-08-03 14:28:08 +02:00
Alessia Bardi
8cc067fe76
specific test for claims
2020-08-03 11:17:50 +02:00
Claudio Atzori
a89b6cc3ba
Merge pull request 'nsprefix_blacklist' ( #34 ) from nsprefix_blacklist into master
2020-07-31 11:52:23 +02:00
Sandro La Bruzzo
0c3bc9ea4b
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-31 09:07:18 +02:00
Sandro La Bruzzo
168bfb496a
adopted dedup to the new schema
2020-07-31 09:06:57 +02:00
Michele Artini
652b13abb6
Merge branch 'master' into nsprefix_blacklist
2020-07-31 07:58:37 +02:00
Claudio Atzori
cd631bb5bc
defaults fixed in the cleaning workflow forces result.publisher to NULL when result.publisher.value in empty
2020-07-30 17:03:53 +02:00
Miriam Baglioni
872d7783fc
-
2020-07-30 16:50:36 +02:00
Miriam Baglioni
57c87b7653
re-implemented to fix issue on not serializable Set<String> variable
2020-07-30 16:43:43 +02:00
Miriam Baglioni
ef8e5957b5
added specific directory where to save results
2020-07-30 16:42:46 +02:00
Miriam Baglioni
75f3361c85
-
2020-07-30 16:41:31 +02:00
Miriam Baglioni
3f695b25fa
refactoring
2020-07-30 16:40:15 +02:00
Miriam Baglioni
e623f12bef
refactoring
2020-07-30 16:32:59 +02:00
Miriam Baglioni
ff7d05abb4
added support class to store the couple organizationId representativeId gaot from sql query on hive
2020-07-30 16:32:04 +02:00
Miriam Baglioni
cf6d80b2ab
added command to close the writer
2020-07-30 16:31:22 +02:00
Miriam Baglioni
f985bca37b
added USER_CLAIM constant value
2020-07-30 16:25:26 +02:00
Claudio Atzori
4bbfcf1ac6
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-30 16:25:06 +02:00
Claudio Atzori
4ff8007518
added function to set the missing vocabulary names, used in the cleaning workflow as a pre-cleaning step
2020-07-30 16:24:39 +02:00
Miriam Baglioni
6f1c40a933
-
2020-07-30 16:24:28 +02:00
Miriam Baglioni
2b66a93f9e
added property file that was missing
2020-07-30 16:24:17 +02:00
Michele Artini
bdece15ca0
blacklist of nsprefix
2020-07-30 16:13:38 +02:00
Sandro La Bruzzo
c97c8f0c44
implemented new oozie job to extract entities in a separate dataset
2020-07-30 12:13:58 +02:00
Sandro La Bruzzo
3010a362bc
updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset
2020-07-30 09:25:56 +02:00
Sandro La Bruzzo
487226f669
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-30 09:25:39 +02:00
Sandro La Bruzzo
16ae3c9ccf
updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset
2020-07-30 09:25:32 +02:00
Miriam Baglioni
ee8420c6b3
added resource for datasource test
2020-07-29 18:28:43 +02:00
Miriam Baglioni
76bcab98ce
added code to filter out null originalId from the dump
2020-07-29 18:28:21 +02:00
Miriam Baglioni
ef1d8aef17
added one test to verify the dump for the datasources
2020-07-29 18:27:46 +02:00
Miriam Baglioni
86bab79512
-
2020-07-29 18:20:22 +02:00
Miriam Baglioni
31791dcf3d
fixed wrong property file path name
2020-07-29 18:20:08 +02:00
Miriam Baglioni
9e722aa1ef
-
2020-07-29 18:00:08 +02:00
Miriam Baglioni
d22f106f27
added constant to identify datasource associated to funders
2020-07-29 17:56:55 +02:00
Miriam Baglioni
40e194fe2f
added check to not dump datasources related to funders
2020-07-29 17:56:18 +02:00
Miriam Baglioni
b48934f6df
changed the workflow name
2020-07-29 17:43:43 +02:00
Miriam Baglioni
1433db825d
refactorign
2020-07-29 17:43:24 +02:00
Miriam Baglioni
074e9ab75e
refactoring
2020-07-29 17:42:50 +02:00
Miriam Baglioni
8ad8dac7d4
merge branch with fork master
2020-07-29 17:38:28 +02:00
Miriam Baglioni
9e997e63a2
merge upstream
2020-07-29 17:38:14 +02:00
Miriam Baglioni
9fa82dc93b
fixed issue
2020-07-29 17:36:16 +02:00
Miriam Baglioni
8907648d6a
-
2020-07-29 17:35:47 +02:00
Miriam Baglioni
536e7f6352
added and changed resources for testing of the whole graph dump and of community related products dumps
2020-07-29 17:33:34 +02:00
Miriam Baglioni
4d7f590493
testings for the whole graph dump
2020-07-29 17:32:37 +02:00
Miriam Baglioni
a2f73ec2c7
changed due to changes in the model
2020-07-29 17:32:02 +02:00
Miriam Baglioni
481585e9d3
-
2020-07-29 17:31:41 +02:00
Miriam Baglioni
40a8dafbdc
-
2020-07-29 17:30:44 +02:00
Miriam Baglioni
de2ebb467e
changed due to changes in the model
2020-07-29 17:08:02 +02:00
Miriam Baglioni
d0ff2a56fb
-
2020-07-29 17:06:53 +02:00
Miriam Baglioni
b96dedb56b
changed due to changes in the model
2020-07-29 17:05:31 +02:00
Miriam Baglioni
6d0f08277b
classes to implement the dump of the whole graph.
2020-07-29 17:03:19 +02:00
Miriam Baglioni
8d4327b292
input parameters and workflow definition for the dump of the whole graph
2020-07-29 17:00:34 +02:00
Miriam Baglioni
b5f995ab12
refactoring
2020-07-29 16:59:48 +02:00
Miriam Baglioni
f7a87cc447
added new constants value
2020-07-29 16:58:40 +02:00
Miriam Baglioni
b71d12cf26
refactoring
2020-07-29 16:52:44 +02:00
Miriam Baglioni
a8d65b68cb
changed to delete the part to check if it was a test or a real execution
2020-07-29 16:47:57 +02:00
Miriam Baglioni
3ec2392904
Added new class to move the place the split is effectively run
2020-07-29 16:46:50 +02:00
Michele Artini
8ba94833bd
added an es prop
2020-07-29 14:16:08 +02:00
Miriam Baglioni
178c2729a7
changed the path to reach the java class to be executed
2020-07-29 12:29:51 +02:00
Miriam Baglioni
437ac12139
removed unused parameter
2020-07-29 12:28:16 +02:00
Claudio Atzori
6f11c0496e
fixed typo in module name dhp-worfklow-profiles -> dhp-workflow-profiles
2020-07-28 15:01:58 +02:00
Claudio Atzori
f680eb3e12
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-28 14:10:56 +02:00
Claudio Atzori
985b360c31
fixed typo in module name dhp-worfklow-profiles -> dhp-workflow-profiles
2020-07-28 14:10:52 +02:00
Michele Artini
3acd632123
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-28 12:02:30 +02:00
Michele Artini
35e6e9c064
tests
2020-07-28 12:02:15 +02:00
Claudio Atzori
ee832f358e
Merge pull request 'stats_wf_extensions_and_corrections' ( #28 ) from spyros/dnet-hadoop:stats_wf_extensions_and_corrections into master
...
Thank you Guys! The update workflow will be made available to the beta & production orchestration systems under the HDFS path
```/lib/dnet/oa/graph/stats/oozie_app```
2020-07-27 16:02:03 +02:00
Antonis Lempesis
4ac8ebe427
correctly calculating the project duration
2020-07-24 19:50:40 +03:00
Antonis Lempesis
18d9464b52
creating shadow db only if it not exists...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
e217d496ab
added the dest db...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
b16bb68b9f
added the target db name...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
1ee7eeedf3
added the source db name...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
cecbbfa0fc
added missing tables and views: contexts, creation_date, funder
2020-07-24 19:50:40 +03:00
Antonis Lempesis
25b7a615f5
moved datasource_sources table creating in the datasource section
2020-07-24 19:50:40 +03:00
Antonis Lempesis
a8da4ab9c0
years in projects are now integers
2020-07-24 19:50:40 +03:00
Antonis Lempesis
c9cfc165d9
not using impala since the resulting tables are not visible
2020-07-24 19:50:40 +03:00
Antonis Lempesis
dd3d6a6e15
compute stats for the used and new impala tables
2020-07-24 19:50:40 +03:00
Antonis Lempesis
e6f50de6ef
Separated impala from hive steps
2020-07-24 19:50:40 +03:00
Antonis Lempesis
de49173420
fixed a typo in queries
2020-07-24 19:50:40 +03:00
antleb
391cf80fb8
Added peer-reviewed, green, gold tables and fields in result. Added shortcuts from result-country
2020-07-24 19:50:40 +03:00
antleb
68389d0125
Corrected the script used by the last step of the wf
2020-07-24 19:50:40 +03:00
antleb
ec52141f1a
changed refereed type from value to clssname
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
63cd797aba
Comment out step 15 to make it work with the new schema of Claudio
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
138c6ddffa
Insert statement to datasource table that takes into account the piwik_id of the openAIRE graph
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
3630794cef
Fix to consider the relationships that have been 'virtually deleted' for project_results - defect #5607
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
5546f29e63
Corrections on the shadow schema and the impala table stats calculation
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
adf8a025d2
Adding more relations (Sources, Licences, Additional) and shadow schema as provided and discussed with Antonis Lempesis
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
657a40536b
Corrections by Spyros: Scipt cleanup, corrections and re-arrangement
2020-07-24 19:50:40 +03:00
Giorgos Alexiou
477fa6234d
Script re-organisation and adding table invalidations needed for impala
2020-07-24 19:50:40 +03:00
Miriam Baglioni
6c2223d1fc
added code to get the openaire id for contexts
2020-07-24 17:30:15 +02:00
Miriam Baglioni
afd54c1684
removed not needed upload and refactoring
2020-07-24 17:28:56 +02:00
Miriam Baglioni
7b0569d989
changed to map also the result associated to the whole graph
2020-07-24 17:28:11 +02:00
Miriam Baglioni
082225ad61
-
2020-07-24 17:27:26 +02:00
Miriam Baglioni
968c59d97a
added teh logic to dump also the products for the whole graph. They will miss collected from and context information that will be materialized as new relations
2020-07-24 17:25:19 +02:00
Miriam Baglioni
332258d199
split the classes related to the communities dump and to the whole graph dump
2020-07-24 17:21:48 +02:00
Claudio Atzori
56bbfdc65d
introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'
2020-07-23 08:54:10 +02:00
Sandro La Bruzzo
9ab594ccf6
fixed test
2020-07-21 10:36:21 +02:00
Claudio Atzori
ebf60020ac
map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type
2020-07-20 19:01:10 +02:00
Miriam Baglioni
355d7e426e
added dumo for project - not finished
2020-07-20 18:54:43 +02:00
Miriam Baglioni
a2f01e5259
added getter and setter
2020-07-20 18:54:17 +02:00
Miriam Baglioni
40bbe94f7c
merge with master fork
2020-07-20 18:10:03 +02:00
Miriam Baglioni
2a15494b16
merge upstream
2020-07-20 18:05:01 +02:00
Miriam Baglioni
23160b4d29
realignment of the workflow classes with the changes in the structure of the module
2020-07-20 18:04:30 +02:00
Miriam Baglioni
b904e0699a
-
2020-07-20 18:02:53 +02:00
Miriam Baglioni
3aab7680f6
changed the test results
2020-07-20 18:00:43 +02:00
Miriam Baglioni
cde0300801
moved from projects to project
2020-07-20 17:57:35 +02:00
Miriam Baglioni
5076e4f320
changed test to comply with the modifications
2020-07-20 17:55:18 +02:00
Miriam Baglioni
08dbd99455
changed to dump the whole results graph by usign classes already implemented for communities. Added class to dump also organization
2020-07-20 17:54:28 +02:00
Miriam Baglioni
e47ea9349c
extended some types by adding provenance as the couple (provenance, trust) and moved some classes to be used by the complete graph dump also
2020-07-20 17:46:27 +02:00
Claudio Atzori
32f5e466e3
imports cleanup
2020-07-20 17:42:58 +02:00
Claudio Atzori
54ac583923
code formatting
2020-07-20 17:37:08 +02:00
Claudio Atzori
124e7ce19c
in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available
2020-07-20 17:33:37 +02:00
Claudio Atzori
050dda223d
Merge pull request 'removed duplicated fields' ( #25 ) from unique_field_in_lists into master
...
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.
The task to update the model in such a way is added on #9#issuecomment-1583
Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori
e0c4cf6f7b
added parameter to drive the graph merge strategy: priority (BETA|PROD)
2020-07-20 10:48:01 +02:00
Claudio Atzori
94ccdb4852
Merge branch 'master' into merge_graph
2020-07-20 10:14:55 +02:00
Claudio Atzori
0937c9998f
Merge branch 'deduptesting'
2020-07-20 10:00:20 +02:00
Claudio Atzori
de72b1c859
cleanup
2020-07-20 09:59:11 +02:00
Michele Artini
331a3cbdd0
fixed originalId
2020-07-20 09:50:29 +02:00
Michele Artini
c59c5369b1
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-18 09:40:54 +02:00
Michele Artini
346a1d2b5a
update eventId generator
2020-07-18 09:40:36 +02:00
Sandro La Bruzzo
9116d75b3e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-17 18:01:30 +02:00
Miriam Baglioni
d7d84c8217
-
2020-07-17 14:03:23 +02:00
Miriam Baglioni
47c7122773
changed priority from beta to production
2020-07-17 12:56:35 +02:00
Michele Artini
442f30930c
removed duplicated fields
2020-07-17 12:25:36 +02:00
Claudio Atzori
1781609508
code formatting
2020-07-16 19:06:56 +02:00
Claudio Atzori
db8b90a156
renamed CORE -> BETA
2020-07-16 19:05:13 +02:00
Miriam Baglioni
44e1c40c42
merge upstream
2020-07-16 18:49:38 +02:00
Claudio Atzori
878f2b931c
Merge branch 'master' into merge_graph
2020-07-16 16:34:24 +02:00
Claudio Atzori
cc5d13da85
introduced parameter shouldIndex (true|false)
2020-07-16 13:46:39 +02:00
Claudio Atzori
b098cc3cbe
avoid repeating identical values for fields: source, description
2020-07-16 13:45:53 +02:00
Claudio Atzori
805de4eca1
fix: filter the blocks with size = 1
2020-07-16 10:11:32 +02:00
Claudio Atzori
4b9fb2ffb8
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-15 11:26:04 +02:00
Claudio Atzori
b90389bac4
code formatting
2020-07-15 11:24:48 +02:00
Claudio Atzori
4e6f46e8fa
filter blocks with one record only
2020-07-15 11:22:20 +02:00
Michele Artini
262c29463e
relations with multiple datasources
2020-07-15 09:18:40 +02:00
Claudio Atzori
7d6e269b40
reverted CreateRelatedEntitiesJob_phase1 to its previous state
2020-07-13 22:54:04 +02:00
Claudio Atzori
8e97598eb4
avoid to NPE in case of null instances
2020-07-13 20:46:14 +02:00
Claudio Atzori
06def0c0cb
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
2020-07-13 20:09:06 +02:00