Commit Graph

390 Commits

Author SHA1 Message Date
Claudio Atzori aeb01fa353 reading from newline delimited json textfiles instead of sequence files 2020-03-17 11:57:24 +01:00
Claudio Atzori af835f2f98 when migrating actionsets from DM cluster, populate the AtomicAction.targetValue when empty (dedup similarities) 2020-03-15 18:07:59 +01:00
Claudio Atzori 9c84e21b87 added workflow to migrate latest version of each actionset content from DM to OCEAN cluster, mapping the targetValues from the old protobuf data model to the dhp.OAF datamodel 2020-03-13 15:56:52 +01:00
Claudio Atzori 8fe7ae1482 xml formatting 2020-03-13 15:53:56 +01:00
Przemysław Jacewicz d0c9b0cdd6 WIP promote job functions updated 2020-03-13 12:36:42 +01:00
Przemysław Jacewicz 8d9b3c5de2 WIP action payload mapping into OAF type moved, (local) graph table name enum created, tests fixed 2020-03-13 10:01:39 +01:00
Przemysław Jacewicz 5cc560c7e5 Removed unnecessary dependency on old OAF model 2020-03-13 09:57:46 +01:00
Sandro La Bruzzo addaaa091f migrate relation from RDD to Dataset 2020-03-13 09:13:20 +01:00
Przemysław Jacewicz 3f24593e51 WIP: promote job tests and test resources implementation snapshot 2020-03-11 17:06:29 +01:00
Przemysław Jacewicz 2e996d610f WIP: promote job functions implementation snapshot 2020-03-11 17:02:57 +01:00
Przemysław Jacewicz cc63cdc9e6 WIP: promote job implementation snapshot 2020-03-11 17:02:06 +01:00
Przemysław Jacewicz 69540f6f78 Serialization-safe supplier added 2020-03-11 16:59:05 +01:00
Przemysław Jacewicz e6e214dab5 Oaf merge and get strategy added 2020-03-11 16:58:17 +01:00
Claudio Atzori 7b6f0c8756 reading graph dump as text files, encoded as newline-delimited JSON records, as indicated in the wiki 2020-03-10 17:19:17 +01:00
Claudio Atzori 60aedb1110 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-10 17:09:44 +01:00
Claudio Atzori a3f184fd3f added field websiteurl in related organizations 2020-03-10 17:08:58 +01:00
Claudio Atzori 0e95544495 fixed serialization for datasource subjects 2020-03-10 17:07:44 +01:00
Sandro La Bruzzo 7b28783fb4 updated unpaywall mapping 2020-03-08 17:00:19 +01:00
Michele Artini b6efa9d6ab Configuration of the SequenceFile Writer 2020-03-05 15:49:14 +01:00
Claudio Atzori 5e342a555c no need to compute the inverse relClass, fixed text() in xpath expressions 2020-03-05 12:51:48 +01:00
Claudio Atzori 6ec04d4e02 specified column used to perform the join operation in the javadoc 2020-03-05 12:50:38 +01:00
Michele Artini 7a2a466161 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-04 14:50:59 +01:00
Michele Artini 755eade2fb fix creation ids 2020-03-04 14:49:45 +01:00
Claudio Atzori 6379f32466 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-04 10:57:06 +01:00
Claudio Atzori 0233987603 introduced post processing step following the hive DB creation/population 2020-03-04 10:56:50 +01:00
Claudio Atzori 1e563bc15e introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase 2020-03-04 10:55:11 +01:00
Claudio Atzori 9af3e904be close the SparkSession at the end 2020-03-04 10:53:31 +01:00
Michele Artini e7167b996a logs and closeable 2020-03-04 10:46:36 +01:00
Claudio Atzori 25ceec29ab code formatting 2020-03-04 10:44:24 +01:00
Claudio Atzori 63c00c5e88 fixed typo 2020-03-04 10:43:44 +01:00
Claudio Atzori 9cf5ce2e66 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-02 17:03:10 +01:00
Claudio Atzori bc7cfd5975 indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure 2020-03-02 17:03:07 +01:00
Michele Artini 4b29a121b0 migration using spark in step2 2020-03-02 16:12:14 +01:00
Michele Artini 5445a57102 migration using spark in step2 2020-03-02 16:11:59 +01:00
Sandro La Bruzzo b32655e48e changed code to save intermediate result 2020-02-27 10:18:46 +01:00
Claudio Atzori 60bc2b1a20 drop the hive DB before populating it from scratch 2020-02-27 10:10:55 +01:00
Sandro La Bruzzo f09e065865 incremented number of repartition 2020-02-26 19:26:19 +01:00
Sandro La Bruzzo 071f5c3e52 fixed NPE 2020-02-26 15:42:20 +01:00
Sandro La Bruzzo a1a6fc8315 fixed NPE 2020-02-26 15:42:13 +01:00
Sandro La Bruzzo 1edf02a3ce added log 2020-02-26 15:25:03 +01:00
Sandro La Bruzzo c3ecabd8e8 fixed NPE 2020-02-26 14:40:02 +01:00
Sandro La Bruzzo 5d0f46651b fixed NPE 2020-02-26 14:31:34 +01:00
Sandro La Bruzzo bc342bf73a fixed wrong generation type in summary 2020-02-26 12:49:47 +01:00
Sandro La Bruzzo 3112e21858 fixed typo 2020-02-26 12:22:43 +01:00
Sandro La Bruzzo 119ae6eef5 fixed wrong loop in the workflow 2020-02-26 12:18:50 +01:00
Sandro La Bruzzo 7936583a3d added generation of Scholix collection 2020-02-26 12:09:06 +01:00
Przemysław Jacewicz 02db368dc5 Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-02-26 11:50:20 +01:00
Sandro La Bruzzo 2ef3705b2c Added Provision workflow 2020-02-26 10:51:35 +01:00
Michele Artini 689908b2e9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-25 16:00:51 +01:00
Michele Artini 93665773ea Fixed a problem with JavaRDD Union 2020-02-25 15:59:21 +01:00
Sandro La Bruzzo b021b8a2e1 Added index wf 2020-02-24 10:15:55 +01:00
Claudio Atzori 6a73fd5da5 in order to reuse the same XmlRecordFactory across different tasks, the state of contexts must be one per record built 2020-02-21 09:17:19 +01:00
Michele Artini d49cd2fdc6 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-02-20 11:21:54 +01:00
Claudio Atzori 5e5e32cb48 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-02-19 16:56:52 +01:00
Claudio Atzori 33185fd0b7 ISLookupClientFactory moved in dhp-common 2020-02-19 16:56:38 +01:00
Michele Artini 5d3739b5cf migration of claims 2020-02-19 15:11:17 +01:00
Michele Artini 173f1df1e5 saved a query for openaire production database 2020-02-19 10:15:08 +01:00
Sandro La Bruzzo 9a2d74ac82 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-19 10:13:45 +01:00
Sandro La Bruzzo e5d7cdf422 fixed sql query 2020-02-19 10:13:36 +01:00
Sandro La Bruzzo 2b8675462f refactoring code 2020-02-19 10:07:08 +01:00
Claudio Atzori ed76521d9b removed stale test resources, will be re-added later on 2020-02-18 11:51:08 +01:00
Claudio Atzori 0f364605ff removed stale tests, need to reimplemente them anyway 2020-02-18 11:48:19 +01:00
Przemysław Jacewicz 958f0693d6 WIP: logic for promoting action sets added 2020-02-17 18:19:19 +01:00
Przemysław Jacewicz bea1a94346 Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype
# Conflicts:
#	dhp-workflows/pom.xml
2020-02-17 15:07:23 +01:00
Claudio Atzori 6a288625e5 fixed workflow outgoing node 2020-02-17 15:04:33 +01:00
Claudio Atzori 1b18fd4d54 sync with master branch 2020-02-17 13:49:46 +01:00
Sandro La Bruzzo 4f04759738 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-17 12:31:58 +01:00
Sandro La Bruzzo 76ee85141a added oozie job for DNET migration and implemented Spark job for extracting entities 2020-02-17 12:31:44 +01:00
Claudio Atzori c460e2d281 Aggiornare 'dhp-workflows/docs/oozie-installer.markdown' 2020-02-17 11:54:48 +01:00
Michele Artini 176c5606bd aligned with origin/master, aligned model and mapping 2020-02-17 10:40:53 +01:00
Claudio Atzori 56d1810a66 working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr 2020-02-14 12:28:52 +01:00
Claudio Atzori 1ee1baa8c0 Merge branch 'master' into provision_indexing 2020-02-13 18:17:07 +01:00
Claudio Atzori a3d0b57b25 [maven-release-plugin] prepare for next development iteration 2020-02-13 18:11:33 +01:00
Claudio Atzori 6ed9a15bc8 [maven-release-plugin] prepare release dhp-1.1.5 2020-02-13 18:11:31 +01:00
Claudio Atzori 49e648f7c3 bumped version 2020-02-13 18:09:31 +01:00
Claudio Atzori f9fae97e09 test json files aligned with the latest model changes 2020-02-13 18:05:59 +01:00
Claudio Atzori 1fee6e2b7e implemented XML records construction and serialization, indexing WIP 2020-02-13 16:53:27 +01:00
Michele Artini 80cb52593f bug fixing 2020-02-13 15:34:13 +01:00
Michele Artini cdea0dae75 bug fixing 2020-02-12 16:34:00 +01:00
Michele Artini 69336195d3 simplifications 2020-02-12 11:12:38 +01:00
Michele Artini 06c2fd6df9 bug fixing 2020-02-11 15:29:50 +01:00
Michele Artini 5fc09b179c bug fixing 2020-02-11 12:48:03 +01:00
Michele Artini 95740767e0 Ready for tests 2020-02-10 16:04:06 +01:00
Michele Artini 181e8498d4 ... 2020-02-07 16:02:49 +01:00
Przemysław Jacewicz 86b60268bb actionmanager implementation prototyping 2020-02-06 19:14:41 +01:00
Michele Artini bb1533a07e partial commit 2020-02-05 15:35:40 +01:00
Michele Artini fbb0fc140b partial implementation of migration 2020-02-04 15:25:47 +01:00
Claudio Atzori 7ba0f44d05 WIP 2020-01-30 18:21:07 +01:00
Claudio Atzori 49ef2f4eb1 removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive 2020-01-30 18:20:26 +01:00
Claudio Atzori b5e1e2e5b2 reintegrated changes from fcbc4ccd70 2020-01-30 18:11:04 +01:00
Claudio Atzori 7bacd6812e Merge branch 'provision_indexing' of https://code-repo.d4science.org/D-Net/dnet-hadoop into HEAD
 Conflicts:
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/GraphJoiner.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/MappingUtils.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/RelatedEntity.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/SparkXmlRecordBuilderJob.java
2020-01-30 17:59:46 +01:00
Claudio Atzori b2691a3b0a save adjacency list as JoinedEntity 2020-01-30 17:46:29 +01:00
Claudio Atzori 8c2aff99b0 joining entities using T x R x S, WIP: last representation based on LinkedEntity type 2020-01-29 15:40:33 +01:00
Sandro La Bruzzo 19a80e4638 implemented workfow for aggregation and generation of infospace graph 2020-01-24 09:58:55 +01:00
Claudio Atzori fcbc4ccd70 a bit of docs doesn't hurt 2020-01-24 08:43:23 +01:00
Claudio Atzori a55f5fecc6 joining entities using T x R x S method with groupByKey, WIP: making target objects (T) have lower memory footprint 2020-01-24 08:17:53 +01:00
Michele Artini 6bfe2dc96e partial implementation 2020-01-22 16:00:23 +01:00
Claudio Atzori 799929c1e3 joining entities using T x R x S method with groupByKey 2020-01-21 16:35:44 +01:00
Michele Artini f6eccdde33 partial implementation 2020-01-21 14:17:05 +01:00
Michele Artini cd114f1c3b partial update 2020-01-21 12:32:10 +01:00
Michele Artini b35c59eb42 partial implementation of entities from db 2020-01-20 16:04:19 +01:00
Michele Artini 81f82b5d34 partial implementation of applications to migrate entities 2020-01-17 15:26:21 +01:00
Claudio Atzori 1cd6899480 merged from master 2020-01-17 14:25:57 +01:00
Claudio Atzori 97c239ee0d WIP: trying to find a way to build the records for the index 2020-01-16 12:02:28 +02:00
miconis 4955be0197 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-01-14 15:03:44 +02:00
miconis f61adfc2bb minor changes 2020-01-14 15:03:27 +02:00
miconis 9bdcb02179 minor changes and update of the configuration for publications 2020-01-14 15:01:03 +02:00
Michele Artini f7b9a7a9af entity migration (partial implementation) 2020-01-10 15:55:23 +01:00
Michele Artini 7229fecbcf fix warnings in poms 2019-12-20 13:41:08 +01:00
Sandro La Bruzzo dd21db7036 fixed stuff 2019-12-18 16:28:22 +01:00
Claudio Atzori 7ba586d2e5 oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed 2019-12-17 16:24:49 +01:00
Sandro La Bruzzo 76efcde4fd using new branch decisionTreeDedup 2019-12-13 12:20:35 +01:00
Sandro La Bruzzo b4392f9f43 implemented DedupRecord factory for missing entities 2019-12-13 09:40:02 +01:00
Sandro La Bruzzo 39367676d7 implemented DedupRecord factory with the merge of project 2019-12-12 15:18:48 +01:00
Sandro La Bruzzo 6b45e37e22 implemented DedupRecord factory with the merge of organizations 2019-12-11 16:57:37 +01:00
Sandro La Bruzzo abd9034da0 implemented DedupRecord factory with the merge of publications 2019-12-11 15:43:24 +01:00
miconis 4b66b471a4 implementation of the sorting by trust mechanism and the merge of oaf entities 2019-12-10 14:57:16 +01:00
Sandro La Bruzzo cc63706347 Implemented deduplication on spark 2019-12-06 13:38:00 +01:00
Sandro La Bruzzo aad0cb40b7 Added schema Scholexplorer 2019-11-14 10:34:09 +01:00
Claudio Atzori 5711e75f67 use ${project.version} whenever possible 2019-11-08 17:41:51 +01:00
Claudio Atzori 245b4cbbb3 removed import limit 2019-11-08 17:41:01 +01:00
Claudio Atzori 7fe6835b47 [maven-release-plugin] prepare for next development iteration 2019-11-07 17:39:30 +01:00
Claudio Atzori 58918967d9 [maven-release-plugin] prepare release dhp-1.0.4 2019-11-07 17:39:27 +01:00
Claudio Atzori 5308f05a02 allow to speficy the target hive DB name in the infospace import workflow 2019-11-07 17:38:09 +01:00
Claudio Atzori a52d5bde4f simplified import procedure, maps the infospace as hive tables 2019-11-06 17:45:52 +01:00
Claudio Atzori 1e7a2ac41d align parmeter names, graph import procedure WIP 2019-11-04 17:41:01 +01:00
Claudio Atzori f39148dab8 [maven-release-plugin] prepare for next development iteration 2019-11-04 12:34:48 +01:00
Claudio Atzori 34b0e7b40a [maven-release-plugin] prepare release dhp-1.0.3 2019-11-04 12:34:46 +01:00
Claudio Atzori 439ad80d81 conversion utilities from protobuffer model to DHP model moved in dnet-mapreduce-jobs. Removed also the relative protobuf dependencies 2019-11-04 12:33:23 +01:00
Claudio Atzori 32ed4ae8d6 conversion utilities from protobuffer model to DHP model moved in dnet-mapreduce-jobs. Removed also the relative protobuf dependencies 2019-11-04 12:28:56 +01:00
Sandro La Bruzzo fd0ad82111 [maven-release-plugin] prepare for next development iteration 2019-10-31 12:08:51 +01:00
Sandro La Bruzzo f224613b40 [maven-release-plugin] prepare release dhp-1.0.2 2019-10-31 12:08:49 +01:00
Sandro La Bruzzo e13c30cc96 [maven-release-plugin] rollback the release of dhp-1.0.2 2019-10-31 12:07:04 +01:00
Sandro La Bruzzo 4da5239203 [maven-release-plugin] prepare release dhp-1.0.2 2019-10-31 12:06:14 +01:00
Sandro La Bruzzo db8b346edd [maven-release-plugin] rollback the release of 1.0.1 2019-10-31 11:49:05 +01:00
Sandro La Bruzzo fc80052173 [maven-release-plugin] prepare for next development iteration 2019-10-31 11:47:42 +01:00
Sandro La Bruzzo 3150c7ce6d [maven-release-plugin] prepare release 1.0.1 2019-10-31 11:47:40 +01:00
Sandro La Bruzzo 18ec8e8147 moved protoutils function to dhp-schemas 2019-10-31 11:31:37 +01:00
Sandro La Bruzzo 997e57d45b Added entity filter to spark class 2019-10-30 12:19:03 +01:00
Sandro La Bruzzo a336956708 added defautl property to job 2019-10-30 12:01:42 +01:00
Claudio Atzori 78b5b57e86 trying to make the spark action to be run as spark2 2019-10-29 18:56:34 +01:00
Claudio Atzori c8bb81cd9a align dependencies with IIS cluster 2019-10-29 18:10:20 +01:00
Sandro La Bruzzo fe62ccd6dd implemented oozie wf 2019-10-28 12:12:50 +01:00
Sandro La Bruzzo 9ee4e5a196 remove a bit of syntactic sugar on the object inheritance :( 2019-10-25 18:10:30 +02:00
Sandro La Bruzzo c74335ebc7 resolved conflict 2019-10-25 14:34:50 +02:00
Sandro La Bruzzo 8c902c500a minor fix 2019-10-25 14:33:54 +02:00
miconis 9fa5aebe9c minor changes 2019-10-25 12:52:28 +02:00
miconis 551eda1600 dataset, orp and software mapping implemented. addition of test resources for results. implementation of tests to check the result of the mapping 2019-10-25 12:48:25 +02:00
Sandro La Bruzzo eef14fade3 fixed conflict 2019-10-25 11:58:20 +02:00
Sandro La Bruzzo 0ea7e861ab added organizations test 2019-10-25 11:56:28 +02:00
miconis 4908165e05 implementation of the createPublication method to map publications 2019-10-25 11:54:14 +02:00
miconis df37bd6aaf placeholders for setters in createpublication 2019-10-25 10:57:19 +02:00
Sandro La Bruzzo c8d6d6bbd1 implemented organization mapping 2019-10-25 10:23:51 +02:00
miconis b525b54130 starting implementing the createPublication class 2019-10-25 09:55:31 +02:00
Claudio Atzori 4b331790e7 resolved conflicts 2019-10-25 09:45:12 +02:00
Claudio Atzori c929c1dfac more proto 2 graph model mappings 2019-10-25 09:25:36 +02:00
Sandro La Bruzzo 09ffda03a2 removed circular dependencies 2019-10-25 09:24:18 +02:00
Sandro La Bruzzo a10d071cf4 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2019-10-24 17:55:44 +02:00
Sandro La Bruzzo 3a8bb11695 mapped first part 2019-10-24 17:55:40 +02:00
Claudio Atzori d46371ceab Merge branch 'master' of https://code-repo.d2science.org/D-Net/dnet-hadoop 2019-10-24 17:43:55 +02:00
Claudio Atzori 0d88f9a6a4 added mapping for projects 2019-10-24 17:43:42 +02:00
Sandro La Bruzzo 2dd9572f41 added Mapping of OriginalDescription 2019-10-24 17:36:44 +02:00
miconis 351d850ad3 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2019-10-24 17:29:07 +02:00
miconis b66a7e3030 publication test added 2019-10-24 17:29:01 +02:00
Sandro La Bruzzo 6c32d418ac added conversion of ExtraInfo 2019-10-24 17:26:55 +02:00
Claudio Atzori 5f339a2c24 added mappings for basic types 2019-10-24 17:21:45 +02:00
Sandro La Bruzzo 9d04111391 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2019-10-24 17:05:52 +02:00
Sandro La Bruzzo 0902bac7dd fixed conflict 2019-10-24 17:05:42 +02:00
Claudio Atzori d8bfaa3687 added mapping for relations 2019-10-24 17:04:13 +02:00
Sandro La Bruzzo d2965636e0 created test for convert json into new OAF data model 2019-10-24 17:02:35 +02:00
Claudio Atzori 79c4f1bbd8 Protobuf to internal graph model, early steps 2019-10-24 16:56:13 +02:00
Claudio Atzori d38aeb8c6e DataInfo.provenanceaction not repeatable, fluent setters 2019-10-24 16:55:38 +02:00
Sandro La Bruzzo 5744a64478 added module dhp=graph-mapper 2019-10-24 16:00:28 +02:00
Sandro La Bruzzo 5a8a323f2a dhp-collection-worker integrated in dhp-workflows 2019-10-24 11:36:59 +02:00
Claudio Atzori dd1d6fcb01 moved libs in main pom file 2019-10-18 10:50:55 +02:00
Claudio Atzori 176a13601b commented out maven plugin for integration tests 2019-10-18 10:50:32 +02:00
Claudio Atzori 0c284e0a51 doc 2019-10-18 09:42:41 +02:00
Claudio Atzori c7654b6fe3 renamed collection & transformation oozie workflow files 2019-10-18 09:42:20 +02:00
Claudio Atzori 44d7e85797 imported oozie-installer.markdown docs from https://github.com/openaire/iis/blob/master/iis-wf/docs/oozie-installer.markdown 2019-10-17 18:43:43 +02:00
Claudio Atzori 27db5afdad integrating the oozie workflow build/deploy/run mechanism, took inspiration from iis 2019-10-17 18:38:30 +02:00
Sandro La Bruzzo bbb87d0e3d implemented saxonHE on transformation spark job 2019-10-10 11:33:51 +02:00
Sandro La Bruzzo 4b8c7c279d Added documentation on a class, and reused ArgumetApplicationParser on dhp-aggregation 2019-10-07 17:02:53 +02:00
Sandro La Bruzzo 53ec9bccca changed the implemetation of RabitMQ Comunication 2019-04-16 12:28:01 +02:00
Sandro La Bruzzo 403c13eebf Implemented message manager, Fixed bug on collection worker, implemented Collecion and Transform spark job 2019-04-11 15:39:29 +02:00
Sandro La Bruzzo ded6aef5e1 moved collector worker 2019-04-03 16:05:16 +02:00
Sandro La Bruzzo c2ecbf5572 moved collector worker 2019-04-03 16:03:36 +02:00
enricoottonello b316467608 added common module 2019-04-03 10:53:54 +02:00
Sandro La Bruzzo 12c65eab4c implemented command line 2019-03-25 15:18:31 +01:00
Sandro La Bruzzo 6156562893 Added test 2019-03-18 10:47:28 +01:00
Sandro La Bruzzo e67d9ee1a9 added first implementation of dnet-workflows 2019-03-18 10:44:35 +01:00