Commit Graph

432 Commits

Author SHA1 Message Date
miconis 418cf94642 implementation of the deletedbyinference test in propagating relations 2020-04-17 10:40:21 +02:00
Claudio Atzori cb0952428e Merge branch 'master' into deduptesting 2020-04-16 14:42:25 +02:00
Claudio Atzori cc21bbfb1a Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 14:41:37 +02:00
Claudio Atzori ec5dfc068d added spark.sql.shuffle.partitions=3840 to dedup scan wf 2020-04-16 14:41:28 +02:00
Claudio Atzori 09f356b047 Merge pull request 'Closes #7: subdirs inside graph table dirs' (#8) from przemyslaw.jacewicz/dnet-hadoop:przemyslawjacewicz_7_distcp_configuration_fix into master
Run the code from this PR in isolation and it worked fine. Thanks!
2020-04-16 14:38:46 +02:00
Claudio Atzori 3437383112 Merge branch 'master' into deduptesting 2020-04-16 12:46:14 +02:00
miconis 0eccbc318b Deduper class (utilities for dedup) cleaned. Useless methods removed 2020-04-16 12:36:37 +02:00
Claudio Atzori 76d23895e6 Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 12:18:32 +02:00
miconis 6a089ec287 minor changes 2020-04-16 12:15:38 +02:00
Claudio Atzori 376efd67de removed prepare statement in spark action 2020-04-16 12:14:16 +02:00
miconis 9b36458b6a Merge branch 'deduptesting' of code-repo.d4science.org:D-Net/dnet-hadoop into deduptesting 2020-04-16 12:13:58 +02:00
miconis cd4d9a148f creating temporary directories in dedup test 2020-04-16 12:13:26 +02:00
Claudio Atzori b39ff36c16 improving the wf definitions 2020-04-16 12:11:37 +02:00
Claudio Atzori 011b342bc9 trying to avoid OOM in SparkPropagateRelation 2020-04-16 11:13:51 +02:00
Claudio Atzori 069ef5eaed trying to avoid OOM in SparkPropagateRelation 2020-04-15 21:23:21 +02:00
Claudio Atzori 8eedfefc98 try to introduce intermediate serialization on hdfs to avoid OOM 2020-04-15 18:35:35 +02:00
Przemysław Jacewicz da019495d7 [dhp-actionmanager] target dir removal added for distcp actions 2020-04-15 17:56:57 +02:00
miconis 5689d49689 minor changes 2020-04-15 16:34:06 +02:00
Claudio Atzori c439d0c6bb PromoteActionPayloadForGraphTableJob reads directly the content pointed by the input path, adjusted promote action tests (ISLookup mock) 2020-04-15 16:18:33 +02:00
Claudio Atzori ff30f99c65 using newline delimited json files for the raw graph materialization. Introduced contentPath parameter 2020-04-15 16:16:20 +02:00
Sandro La Bruzzo 3d3ac76dda Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-15 15:24:01 +02:00
Sandro La Bruzzo 74a7fac774 fixed problem with timestamp 2020-04-15 15:23:54 +02:00
Alessia Bardi 550a9f82ed Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-04-14 17:53:01 +02:00
Alessia Bardi a68fae9bcb now supporting openaire 4.0 compliance 2020-04-14 17:52:48 +02:00
Sandro La Bruzzo c36239e693 fixed incremental indexing 2020-04-14 17:47:36 +02:00
Claudio Atzori 82e8341f50 reorganizing parameter names in the provision workflow 2020-04-14 15:54:41 +02:00
Claudio Atzori 6b5f9ca9cb raw graph creation workflow moved under dhp-graph-mapper, claims integration is included 2020-04-10 17:53:07 +02:00
miconis 0be2e72be5 further implementation of tests for the deduplication of each entity. publication dump added, empty entity files created 2020-04-08 18:02:30 +02:00
Claudio Atzori 47f3d9b757 unit test for GraphHiveImporterJob 2020-04-08 13:24:43 +02:00
Sandro La Bruzzo ba9f07a6fe fixed wrong test 2020-04-08 13:18:20 +02:00
Claudio Atzori d74e128aa6 Utility classes moved in dhp-common and dhp-schemas 2020-04-07 11:56:22 +02:00
Claudio Atzori c57cf679ca Merge branch 'provision_dataset' 2020-04-07 08:56:58 +02:00
Claudio Atzori 1a1a026a18 we do expect to find field bestaccessright already defined. No need to add it again 2020-04-07 08:55:33 +02:00
Claudio Atzori fbdd18a96b using dataset based relation preparation procedure 2020-04-07 08:54:39 +02:00
Claudio Atzori 77f59b1b10 dataset based provision WIP 2020-04-06 19:37:27 +02:00
Claudio Atzori 6177cf36fb Merge pull request 'Closes #4: New action manager implementation' (#5) from przemyslaw.jacewicz/dnet-hadoop:przemyslawjacewicz_actionmanager_impl_prototype into master
Nothing more to add here. Thanks for your contribution!
2020-04-06 17:35:07 +02:00
Claudio Atzori e355961997 dataset based provision WIP 2020-04-06 17:34:25 +02:00
miconis 56fbe689f0 implementation of the tests for each spark action 2020-04-06 16:30:31 +02:00
Claudio Atzori ca345aaad3 dataset based provision WIP 2020-04-06 15:33:31 +02:00
Claudio Atzori c8f4b95464 dataset based provision WIP 2020-04-06 08:59:58 +02:00
Claudio Atzori eb2f5f3198 dataset based provision WIP 2020-04-04 17:41:31 +02:00
Claudio Atzori 3d1b637cab dataset based provision WIP 2020-04-04 14:03:43 +02:00
miconis 53fd624c34 implemented test for sparkcreatesimrels 2020-04-03 18:32:25 +02:00
Claudio Atzori 24b2c9012e dataset based provision WIP 2020-04-02 18:44:09 +02:00
miconis a61763d149 structure for sparksimrel changed to be compliant with mockito testing 2020-04-02 18:37:53 +02:00
Claudio Atzori daa26acc9d dataset based provision WIP, fixed spark2EventLogDir 2020-04-02 16:15:50 +02:00
Przemysław Jacewicz 7b2a7e2417 [dhp-actionmanager] missing descriptions added and minor naming and formatting fixes 2020-04-02 11:48:40 +02:00
Spyros Zoupanos 1ab97bbe00 Adding the full stats workflow to the dnet-hadoop hierarchy 2020-04-01 22:22:05 +03:00
Claudio Atzori 9c7092416a dataset based provision WIP 2020-04-01 19:07:30 +02:00
miconis bfa5bc74df minor changes 2020-04-01 19:05:48 +02:00
Przemysław Jacewicz 80cf43b9c8 [dhp-actionmanager] promoting workflow added 2020-04-01 18:51:25 +02:00
Przemysław Jacewicz 5b459bcc47 [dhp-actionmanager] promoting spark job added 2020-04-01 18:49:08 +02:00
miconis 9802bcb9fe dedup testing 2020-04-01 18:48:31 +02:00
Przemysław Jacewicz e21bb89dbd [dhp-actionmanager] partitioning spark job added 2020-04-01 18:41:29 +02:00
Przemysław Jacewicz f9f7350bb9 [dhp-actionmanager] common package added with utility classes supporting hadoop and spark envs 2020-04-01 18:39:26 +02:00
Przemysław Jacewicz ad70c23b2e [dhp-actionmanager] pom updated 2020-04-01 18:36:00 +02:00
Przemysław Jacewicz 4e910a78d4 [dhp-workflows] spark 2 connection properties added 2020-04-01 18:29:26 +02:00
Claudio Atzori 1402eb1fe7 cleanup 2020-04-01 15:38:50 +02:00
Claudio Atzori 7061d07727 ActionSets migration serialize the output as plain text files instead of SequenceFiles 2020-04-01 14:58:22 +02:00
Claudio Atzori adcdd2d05e WIP: reimplementing the adjacency list construction process using spark Datasets 2020-04-01 14:56:57 +02:00
Sandro La Bruzzo 205e9521c6 implemented import crossref job 2020-04-01 14:12:33 +02:00
Sandro La Bruzzo 201d79021e Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-31 14:41:41 +02:00
Sandro La Bruzzo cd7416ae4c first implementation of incremental update of scholix index 2020-03-31 14:41:35 +02:00
przemek 9d1d18d4b9 Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-03-31 12:04:58 +02:00
Claudio Atzori 377e1ba840 [maven-release-plugin] prepare for next development iteration 2020-03-30 20:06:00 +02:00
Claudio Atzori 76d9315129 [maven-release-plugin] prepare release dhp-1.1.6 2020-03-30 20:05:56 +02:00
Claudio Atzori ef429010ee removed log file and job-override.properties 2020-03-30 20:00:58 +02:00
Claudio Atzori 0fbec69b82 use oozie prepare statement to cleanup working directories 2020-03-30 19:48:41 +02:00
Claudio Atzori 3af2b8d700 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-30 13:12:21 +02:00
Claudio Atzori f3f9affd49 allow dynamic executors to build XML records 2020-03-30 13:12:11 +02:00
Claudio Atzori 2e2d4c4c68 adjusted path to template resource 2020-03-30 13:11:49 +02:00
Sandro La Bruzzo 62cc257e5c fixed step1 workflow 2020-03-27 17:07:34 +01:00
Sandro La Bruzzo 1a7a866861 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-27 15:11:48 +01:00
Sandro La Bruzzo 7cef698f36 reformat code 2020-03-27 15:11:34 +01:00
Claudio Atzori 1767dfaa3f method can be protected, it is meant to be used only in tests 2020-03-27 14:31:26 +01:00
Sandro La Bruzzo a4b6a51168 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-27 13:48:56 +01:00
Sandro La Bruzzo 15d9106b3f FIxed merge of dhp dedup 2020-03-27 13:48:44 +01:00
Claudio Atzori e196fff212 adjusted path for source resource in unit test 2020-03-27 13:45:10 +01:00
Sandro La Bruzzo 8c9a56a0c8 refactored package name 2020-03-27 13:19:33 +01:00
Sandro La Bruzzo 2bd2d6f202 Merge branch 'master' of code-repo.d3science.org:D-Net/dnet-hadoop 2020-03-27 13:16:36 +01:00
Sandro La Bruzzo a9935f80d4 refactor class name and workflow name for graph mapper, added javadoc 2020-03-27 13:16:24 +01:00
Michele Artini ae03948eed Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-27 11:47:07 +01:00
Michele Artini f6e86b44a6 tests 2020-03-27 11:46:37 +01:00
Michele Artini 408be3c632 test and fixed a problem with datacite namespaces 2020-03-27 11:44:50 +01:00
Claudio Atzori 673e744649 moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa 2020-03-27 10:42:17 +01:00
Claudio Atzori 098fabab3f reorganizing content under dhp-workflows/dhp-graph-mapper 2020-03-26 19:44:19 +01:00
Claudio Atzori 77c4294924 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-26 18:26:52 +01:00
Claudio Atzori 43cbcda7ef unit test for SparkGraphImporterJob 2020-03-26 18:26:40 +01:00
Sandro La Bruzzo e04da6d66a merged all oozie wf in one 2020-03-26 14:17:07 +01:00
Sandro La Bruzzo e71e001b58 commented test that doesn't work 2020-03-26 14:15:21 +01:00
Sandro La Bruzzo 0cd022ad6a merge with master 2020-03-26 14:08:29 +01:00
Claudio Atzori abcd3f5bf5 added sample data for unit tests 2020-03-26 11:12:52 +01:00
Sandro La Bruzzo d5f11e27be renamed wf 2020-03-26 09:49:23 +01:00
Sandro La Bruzzo 9a37ad0127 renamed modules 2020-03-26 09:46:46 +01:00
Sandro La Bruzzo a768226e52 updated generate scholix to generate json 2020-03-26 09:40:50 +01:00
Claudio Atzori 9dff4adbc3 dhp-graph-mapper workflow tests upgraded to junit5 2020-03-25 18:25:12 +01:00
Claudio Atzori cd7dc3e1ae dhp-dedup-openaire workflow tests upgraded to junit5 2020-03-25 18:04:23 +01:00
Claudio Atzori c0e825e713 dhp-aggregation workflow tests upgraded to junit5 2020-03-25 17:59:45 +01:00
Michele Artini ebe45003d9 fixed some junit packages 2020-03-25 16:45:03 +01:00
Michele Artini d9bfdcd607 updated poms 2020-03-25 16:31:12 +01:00
Michele Artini 120e823cd1 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-25 16:00:10 +01:00
Claudio Atzori 71ae7dd272 renamed module dnet-dedup to dnet-dedup-openaire 2020-03-25 15:57:09 +01:00
Michele Artini fd57722c69 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-25 15:56:49 +01:00
Claudio Atzori f441f823dd fixed path referencing a test resource file 2020-03-25 15:21:46 +01:00
Claudio Atzori 51d0c9bdd7 integrated changes from branch dedupTest 2020-03-25 15:15:41 +01:00
Claudio Atzori 36f8f2ea66 master set to 'yarn' in spark actions, removed path to rawSet from the dedup scan workflow 2020-03-25 14:16:06 +01:00
Michele Artini 2559299da4 tests 2020-03-25 12:25:00 +01:00
Claudio Atzori 2180cc4fe7 more fields included in result view definition 2020-03-25 11:21:46 +01:00
Claudio Atzori efb0b7d660 master set to 'yarn' in spark actions 2020-03-25 11:15:35 +01:00
Michele Artini 0fda2c3a30 some tests on db records 2020-03-25 09:43:58 +01:00
miconis 02320de371 minor changes 2020-03-24 17:43:51 +01:00
miconis 8e8b5e8f30 roots wf merged in scan wf 2020-03-24 17:40:58 +01:00
Claudio Atzori 51ff68db66 Merge branch 'dedupTest' of https://code-repo.d4science.org/D-Net/dnet-hadoop into dedupTest 2020-03-24 11:18:19 +01:00
Claudio Atzori 1e869e7bed using method available from currently used library 2020-03-24 11:17:44 +01:00
miconis f0d72b76a8 package structure fixed 2020-03-24 10:51:40 +01:00
Claudio Atzori aaedbb1b8b WIP: dedup workflow, stage 2 2020-03-24 09:59:28 +01:00
Michele Artini e3760c7f39 fix a bug with organization countries 2020-03-24 08:43:56 +01:00
Claudio Atzori 8b0ba3d76a posprocessing script correctly run as hive2 action 2020-03-23 17:40:39 +01:00
miconis 93e2291291 minor changes 2020-03-23 17:17:56 +01:00
miconis f7890a90df implementation of the mechanism that checks the existance of a mergerel file 2020-03-23 17:13:30 +01:00
miconis c20e179f5a structure of the workflows updated 2020-03-23 11:43:49 +01:00
Claudio Atzori 658d40ccbe WIP trying to use hive2 actions 2020-03-23 11:14:54 +01:00
Claudio Atzori ecb64e4998 Merge branch 'migration_wfs_regular_all_steps' 2020-03-23 08:57:01 +01:00
Michele Artini 15160032bd fixed a bug setting some organization fields 2020-03-23 08:39:14 +01:00
Claudio Atzori a4c52661a0 WIP: fixing dedup workflows 2020-03-20 19:17:24 +01:00
Claudio Atzori 6cb0a9bff0 dedup wf directory structure aligned with project commons 2020-03-20 16:48:14 +01:00
miconis e16e644faf implementation of the workflow for entity update and for relations update 2020-03-20 13:01:56 +01:00
przemek 638b78f96a Merge remote-tracking branch 'origin/master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-03-19 15:12:56 +01:00
miconis 4e82a24af2 minor changes and implementation of the create connected components action 2020-03-19 15:01:07 +01:00
Claudio Atzori 36236dd1c1 action migration workflow produces eu.dnetlib.dhp.schema.action.AtomicAction(s) 2020-03-19 14:00:38 +01:00
Claudio Atzori a0ab15a64c need to stick on using guava:11.0.2 as it is the version used by the hadoop components (oozie client for sure). The last version (28.2-jre) breaks the oozie workflow submission 2020-03-19 13:58:58 +01:00
Sandro La Bruzzo 0594b92a6d implemented relation with dataset 2020-03-19 11:11:07 +01:00
miconis 679b5869e5 implementation of the lookup procedure to take dedup conf from the resource profiles 2020-03-18 17:41:56 +01:00
Claudio Atzori abe8fb69a2 added global properties, moved postprocessing script inside the oozie_app directory 2020-03-18 15:43:54 +01:00
miconis f32eae5ce9 implementation of the spark action for the simrel creation 2020-03-18 14:27:49 +01:00
Claudio Atzori c7e0730720 compress the output produced by migration steps 1 and 2 2020-03-18 09:34:57 +01:00
Claudio Atzori 2f11e37602 fixed expansion of path variables 2020-03-17 19:41:07 +01:00
Claudio Atzori 2795b0b096 no need to mkdir a the all_entities file 2020-03-17 17:22:14 +01:00
Claudio Atzori 19746ad308 when reuseContent, reset ${workingPath}/all_entities 2020-03-17 17:17:06 +01:00
Claudio Atzori 2f0c85eeb3 updated parameters for regular_all_steps worfklow, introduced flag 'reuseContent' 2020-03-17 17:04:58 +01:00
Claudio Atzori b8290b5851 updated parameters for regular_all_steps worfklow 2020-03-17 15:45:30 +01:00
Claudio Atzori 4706f24ec5 updated parameters for regular_all_steps worfklow 2020-03-17 15:23:54 +01:00
Claudio Atzori aeb01fa353 reading from newline delimited json textfiles instead of sequence files 2020-03-17 11:57:24 +01:00
Claudio Atzori af835f2f98 when migrating actionsets from DM cluster, populate the AtomicAction.targetValue when empty (dedup similarities) 2020-03-15 18:07:59 +01:00
Claudio Atzori 9c84e21b87 added workflow to migrate latest version of each actionset content from DM to OCEAN cluster, mapping the targetValues from the old protobuf data model to the dhp.OAF datamodel 2020-03-13 15:56:52 +01:00
Claudio Atzori 8fe7ae1482 xml formatting 2020-03-13 15:53:56 +01:00
Przemysław Jacewicz d0c9b0cdd6 WIP promote job functions updated 2020-03-13 12:36:42 +01:00
Przemysław Jacewicz 8d9b3c5de2 WIP action payload mapping into OAF type moved, (local) graph table name enum created, tests fixed 2020-03-13 10:01:39 +01:00
Przemysław Jacewicz 5cc560c7e5 Removed unnecessary dependency on old OAF model 2020-03-13 09:57:46 +01:00
Sandro La Bruzzo addaaa091f migrate relation from RDD to Dataset 2020-03-13 09:13:20 +01:00