Commit Graph

560 Commits

Author SHA1 Message Date
Claudio Atzori 0b55795d4d small adjustments in the provisioning workflow 2020-04-21 16:15:04 +02:00
Claudio Atzori 88fbb3a353 added sparkSqlWarehouseDir to the default extra spark options passed to each workflow 2020-04-21 16:13:43 +02:00
Claudio Atzori cd320efa96 added extra spark options to graph to hive workflow 2020-04-21 16:12:20 +02:00
Miriam Baglioni 90c768dde6 added shaded libs module 2020-04-21 16:03:51 +02:00
Claudio Atzori 91e72a6944 Dataset based implementation for SparkCreateDedupRecord phase, fixed datasource entity dump supplementing dedup unit tests 2020-04-21 12:06:08 +02:00
miconis 5c9ef08a8e spark dedup test fixed 2020-04-21 10:19:04 +02:00
Claudio Atzori d772d967aa restored changes from master branch 2020-04-20 18:53:06 +02:00
Claudio Atzori eb8a020859 fixed behaviour of DedupRecordFactory 2020-04-20 18:44:06 +02:00
Claudio Atzori ede1af3d85 Merge branch 'master' into deduptesting 2020-04-20 16:52:14 +02:00
miconis 1102e32462 SparkDedupTest updated and organization dump fixed 2020-04-20 16:49:01 +02:00
Claudio Atzori 667d23c58b finalising Actionset migration workflow 2020-04-20 16:45:21 +02:00
miconis 4da13e4570 Revert "Merge branch 'master' into deduptesting"
This reverts commit 772f75d167, reversing
changes made to 5f45f2c77f.
2020-04-20 16:04:49 +02:00
Claudio Atzori 9147af7fed actionsets migration workflow moved in dhp-workflows/dhp-actionmanager 2020-04-20 15:24:33 +02:00
miconis 772f75d167 Merge branch 'master' into deduptesting 2020-04-20 14:50:12 +02:00
Claudio Atzori d714bfb4d4 collectedfrom field moved in common parent class Oaf.java 2020-04-20 12:25:19 +02:00
Michele Artini 8ff7facfa3 fixed collectedFrom ID 2020-04-20 11:09:27 +02:00
Michele Artini 25307965d2 add a default datainfo if missing 2020-04-20 09:43:27 +02:00
Michele Artini d2058fdc47 tests 2020-04-20 09:31:14 +02:00
Michele Artini 478a958f09 tests 2020-04-20 09:15:27 +02:00
Miriam Baglioni e1848b7603 minor 2020-04-18 14:16:42 +02:00
Miriam Baglioni 0ff9b1ef05 added needed parameter 2020-04-18 14:16:29 +02:00
Miriam Baglioni e2dfe8b656 removed not used action 2020-04-18 14:16:07 +02:00
Miriam Baglioni 437ebbad76 refactorign 2020-04-18 14:15:09 +02:00
Miriam Baglioni 9a8876ac86 added needed parameter 2020-04-18 14:14:08 +02:00
Miriam Baglioni 9854852878 refactoring 2020-04-18 14:13:16 +02:00
Miriam Baglioni 454b8a6a29 Merge remote-tracking branch 'upstream/master' 2020-04-18 14:09:44 +02:00
Miriam Baglioni 890ec28f0f input parameters for preparation step1 2020-04-18 14:09:37 +02:00
Miriam Baglioni fbf5c27c27 Added preparation classes before actual propagation 2020-04-18 14:09:03 +02:00
Claudio Atzori 5f45f2c77f Merge branch 'master' into deduptesting 2020-04-18 12:46:40 +02:00
Claudio Atzori ad7a131b18 introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin, applied to each java class in the project 2020-04-18 12:42:58 +02:00
Claudio Atzori a2938dd059 cleanup 2020-04-18 12:24:22 +02:00
Claudio Atzori 9374ff03ea Merge branch 'master' into deduptesting 2020-04-18 12:06:58 +02:00
Claudio Atzori 71813795f6 various refactorings on the dnet-dedup-openaire workflow 2020-04-18 12:06:23 +02:00
miconis 6450bb0daa test for softwares dedup added. definition of orp, dataset and sw dedup configurations 2020-04-17 17:31:59 +02:00
Miriam Baglioni 72c63a326e removed unuseful class 2020-04-17 17:14:51 +02:00
Miriam Baglioni 00c2ca3ee5 - 2020-04-17 17:14:25 +02:00
Miriam Baglioni 5cd092114f use mergeFrom method to add the new community contexts 2020-04-17 17:13:18 +02:00
Miriam Baglioni 264c82f21e minor 2020-04-17 16:54:46 +02:00
Miriam Baglioni 8c079c7a49 unit test for orcid to result propagation from semrel 2020-04-17 16:53:03 +02:00
Miriam Baglioni eacd140a98 added missing parameter(s) 2020-04-17 16:52:30 +02:00
Miriam Baglioni 390e250faf use the addPid method of the Author class to add a new pid 2020-04-17 16:52:02 +02:00
Miriam Baglioni b46b080ddc use mergeFrom method call to add the country(ies) instead of modify the result directly. 2020-04-17 16:50:54 +02:00
Miriam Baglioni c4987dd12a minor 2020-04-17 16:49:08 +02:00
Claudio Atzori 038ac7afd7 relation consistency workflow separated from dedup scan and creation of CCs 2020-04-17 13:12:44 +02:00
Claudio Atzori c92bfeeaee Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-04-17 13:07:52 +02:00
Miriam Baglioni adc11c97a7 Merge remote-tracking branch 'upstream/master' 2020-04-17 12:34:31 +02:00
Sandro La Bruzzo 01ea7721f3 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-17 12:12:25 +02:00
Sandro La Bruzzo 5e2fa996aa fixed problem with conversion of long into string 2020-04-17 12:11:51 +02:00
miconis 418cf94642 implementation of the deletedbyinference test in propagating relations 2020-04-17 10:40:21 +02:00
Miriam Baglioni 5d772e5263 new implementation of propagation of community to result from organization that exploits the prepared info 2020-04-16 18:45:22 +02:00
Miriam Baglioni fff1e5ec39 classes to (de)serialize the data provided in the preparation step 2020-04-16 18:44:43 +02:00
Miriam Baglioni 3fd9d6b02f preparation phase for the propagation of community to result from organization 2020-04-16 18:43:55 +02:00
Miriam Baglioni a9120164aa added hive parameter and a step of reset of the working dir in the workflow 2020-04-16 18:42:04 +02:00
Miriam Baglioni 6afbd542ca changed the save mode to avoid NegativeArraySize... error. Needed to modify also the preparationstep2 2020-04-16 18:40:14 +02:00
Miriam Baglioni d60fd36046 changed the save method 2020-04-16 16:14:15 +02:00
Miriam Baglioni 951b13ac46 input parameters and workflow for new implementation of propagation of orcid to result from semrel and preparation phases 2020-04-16 16:13:10 +02:00
Miriam Baglioni 4d89f3dfed removed unuseful classes 2020-04-16 16:11:44 +02:00
Miriam Baglioni 5e72a51f11 - 2020-04-16 16:11:20 +02:00
Miriam Baglioni c33a593381 renamed 2020-04-16 16:09:47 +02:00
Miriam Baglioni 0e5399bf74 seconf phase of data preparation. Groups all the possible updates by id 2020-04-16 16:08:51 +02:00
Miriam Baglioni 548ba915ac first phase of data preparation. For each result type (parallel) it produces the possible updates 2020-04-16 15:58:42 +02:00
Miriam Baglioni 243013cea3 to (de)serialize the association from the resultId and the list of autoritative authors with orcid to possibly propagate 2020-04-16 15:57:29 +02:00
Miriam Baglioni ac3ad25b36 to (de)serialize needed information of the author to determine if the orcid can be passed (name, surname, fullname (?), orcid) 2020-04-16 15:56:33 +02:00
Miriam Baglioni d6cd700a32 new implementation that exploits prepared information (the list of possible updates: resultId - possible list of orcid to be added 2020-04-16 15:55:25 +02:00
Miriam Baglioni f077f22f73 minor 2020-04-16 15:54:16 +02:00
Miriam Baglioni fd5d792e35 refactoring 2020-04-16 15:53:34 +02:00
Claudio Atzori cb0952428e Merge branch 'master' into deduptesting 2020-04-16 14:42:25 +02:00
Claudio Atzori cc21bbfb1a Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 14:41:37 +02:00
Claudio Atzori ec5dfc068d added spark.sql.shuffle.partitions=3840 to dedup scan wf 2020-04-16 14:41:28 +02:00
Claudio Atzori 09f356b047 Merge pull request 'Closes #7: subdirs inside graph table dirs' (#8) from przemyslaw.jacewicz/dnet-hadoop:przemyslawjacewicz_7_distcp_configuration_fix into master
Run the code from this PR in isolation and it worked fine. Thanks!
2020-04-16 14:38:46 +02:00
Claudio Atzori 3437383112 Merge branch 'master' into deduptesting 2020-04-16 12:46:14 +02:00
miconis 0eccbc318b Deduper class (utilities for dedup) cleaned. Useless methods removed 2020-04-16 12:36:37 +02:00
Claudio Atzori 76d23895e6 Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 12:18:32 +02:00
miconis 6a089ec287 minor changes 2020-04-16 12:15:38 +02:00
Claudio Atzori 376efd67de removed prepare statement in spark action 2020-04-16 12:14:16 +02:00
miconis 9b36458b6a Merge branch 'deduptesting' of code-repo.d4science.org:D-Net/dnet-hadoop into deduptesting 2020-04-16 12:13:58 +02:00
miconis cd4d9a148f creating temporary directories in dedup test 2020-04-16 12:13:26 +02:00
Claudio Atzori b39ff36c16 improving the wf definitions 2020-04-16 12:11:37 +02:00
Claudio Atzori 011b342bc9 trying to avoid OOM in SparkPropagateRelation 2020-04-16 11:13:51 +02:00
Miriam Baglioni 08227cfcbd resources needed for running the test on propagation of result to organization from institutional repositories 2020-04-16 11:06:10 +02:00
Miriam Baglioni a97e915c24 test unit for propagation of result to organization from institutional repository 2020-04-16 11:05:21 +02:00
Miriam Baglioni b078710924 modification to the test due to the removal of unused parameters 2020-04-16 11:04:39 +02:00
Miriam Baglioni a5e5c81a2c input parameters and workflow definition for propagation of result to organization from institutional repositories 2020-04-16 11:03:41 +02:00
Miriam Baglioni 5e1bd67680 removed unuseful parameter 2020-04-16 11:02:01 +02:00
Miriam Baglioni eaf19ce01b removed unuseful class 2020-04-16 10:59:33 +02:00
Miriam Baglioni 7bd49abbef commit to delete 2020-04-16 10:59:09 +02:00
Miriam Baglioni 53f418098b added the isTest checkpoint 2020-04-16 10:53:48 +02:00
Miriam Baglioni c28333d43f minor 2020-04-16 10:52:50 +02:00
Miriam Baglioni a8100baed6 changed the way to save the results to aviod NegativeArray... error 2020-04-16 10:50:09 +02:00
Miriam Baglioni 79b978ec57 refactoring 2020-04-16 10:48:41 +02:00
Claudio Atzori 069ef5eaed trying to avoid OOM in SparkPropagateRelation 2020-04-15 21:23:21 +02:00
Claudio Atzori 8eedfefc98 try to introduce intermediate serialization on hdfs to avoid OOM 2020-04-15 18:35:35 +02:00
Przemysław Jacewicz da019495d7 [dhp-actionmanager] target dir removal added for distcp actions 2020-04-15 17:56:57 +02:00
miconis 5689d49689 minor changes 2020-04-15 16:34:06 +02:00
Claudio Atzori c439d0c6bb PromoteActionPayloadForGraphTableJob reads directly the content pointed by the input path, adjusted promote action tests (ISLookup mock) 2020-04-15 16:18:33 +02:00
Claudio Atzori ff30f99c65 using newline delimited json files for the raw graph materialization. Introduced contentPath parameter 2020-04-15 16:16:20 +02:00
Sandro La Bruzzo 3d3ac76dda Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-15 15:24:01 +02:00
Sandro La Bruzzo 74a7fac774 fixed problem with timestamp 2020-04-15 15:23:54 +02:00
Miriam Baglioni 3577219127 removed unuseful classes 2020-04-15 12:45:49 +02:00
Miriam Baglioni 964b22d418 modified the writing of the new relations. before: read old rels, add the new ones to them, write all the relations in new location. Now: first step of the wf copies the old relation i new location. If new relations are found, they are saved in the new location in append mode. 2020-04-15 12:32:01 +02:00