Commit Graph

274 Commits

Author SHA1 Message Date
Sandro La Bruzzo addaaa091f migrate relation from RDD to Dataset 2020-03-13 09:13:20 +01:00
Claudio Atzori 7b6f0c8756 reading graph dump as text files, encoded as newline-delimited JSON records, as indicated in the wiki 2020-03-10 17:19:17 +01:00
Claudio Atzori 60aedb1110 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-10 17:09:44 +01:00
Claudio Atzori a3f184fd3f added field websiteurl in related organizations 2020-03-10 17:08:58 +01:00
Claudio Atzori 0e95544495 fixed serialization for datasource subjects 2020-03-10 17:07:44 +01:00
Sandro La Bruzzo 7b28783fb4 updated unpaywall mapping 2020-03-08 17:00:19 +01:00
Michele Artini b6efa9d6ab Configuration of the SequenceFile Writer 2020-03-05 15:49:14 +01:00
Claudio Atzori 5e342a555c no need to compute the inverse relClass, fixed text() in xpath expressions 2020-03-05 12:51:48 +01:00
Claudio Atzori 6ec04d4e02 specified column used to perform the join operation in the javadoc 2020-03-05 12:50:38 +01:00
Michele Artini 7a2a466161 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-04 14:50:59 +01:00
Michele Artini 755eade2fb fix creation ids 2020-03-04 14:49:45 +01:00
Claudio Atzori 6379f32466 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-04 10:57:06 +01:00
Claudio Atzori 0233987603 introduced post processing step following the hive DB creation/population 2020-03-04 10:56:50 +01:00
Claudio Atzori 1e563bc15e introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase 2020-03-04 10:55:11 +01:00
Claudio Atzori 9af3e904be close the SparkSession at the end 2020-03-04 10:53:31 +01:00
Michele Artini e7167b996a logs and closeable 2020-03-04 10:46:36 +01:00
Claudio Atzori 25ceec29ab code formatting 2020-03-04 10:44:24 +01:00
Claudio Atzori 63c00c5e88 fixed typo 2020-03-04 10:43:44 +01:00
Claudio Atzori 9cf5ce2e66 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-02 17:03:10 +01:00
Claudio Atzori bc7cfd5975 indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure 2020-03-02 17:03:07 +01:00
Michele Artini 4b29a121b0 migration using spark in step2 2020-03-02 16:12:14 +01:00
Michele Artini 5445a57102 migration using spark in step2 2020-03-02 16:11:59 +01:00
Sandro La Bruzzo b32655e48e changed code to save intermediate result 2020-02-27 10:18:46 +01:00
Claudio Atzori 60bc2b1a20 drop the hive DB before populating it from scratch 2020-02-27 10:10:55 +01:00
Sandro La Bruzzo f09e065865 incremented number of repartition 2020-02-26 19:26:19 +01:00
Sandro La Bruzzo 071f5c3e52 fixed NPE 2020-02-26 15:42:20 +01:00
Sandro La Bruzzo a1a6fc8315 fixed NPE 2020-02-26 15:42:13 +01:00
Sandro La Bruzzo 1edf02a3ce added log 2020-02-26 15:25:03 +01:00
Sandro La Bruzzo c3ecabd8e8 fixed NPE 2020-02-26 14:40:02 +01:00
Sandro La Bruzzo 5d0f46651b fixed NPE 2020-02-26 14:31:34 +01:00
Sandro La Bruzzo bc342bf73a fixed wrong generation type in summary 2020-02-26 12:49:47 +01:00
Sandro La Bruzzo 3112e21858 fixed typo 2020-02-26 12:22:43 +01:00
Sandro La Bruzzo 119ae6eef5 fixed wrong loop in the workflow 2020-02-26 12:18:50 +01:00
Sandro La Bruzzo 7936583a3d added generation of Scholix collection 2020-02-26 12:09:06 +01:00
Sandro La Bruzzo 2ef3705b2c Added Provision workflow 2020-02-26 10:51:35 +01:00
Michele Artini 689908b2e9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-25 16:00:51 +01:00
Michele Artini 93665773ea Fixed a problem with JavaRDD Union 2020-02-25 15:59:21 +01:00
Sandro La Bruzzo b021b8a2e1 Added index wf 2020-02-24 10:15:55 +01:00
Claudio Atzori 6a73fd5da5 in order to reuse the same XmlRecordFactory across different tasks, the state of contexts must be one per record built 2020-02-21 09:17:19 +01:00
Michele Artini d49cd2fdc6 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-02-20 11:21:54 +01:00
Claudio Atzori 5e5e32cb48 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-02-19 16:56:52 +01:00
Claudio Atzori 33185fd0b7 ISLookupClientFactory moved in dhp-common 2020-02-19 16:56:38 +01:00
Michele Artini 5d3739b5cf migration of claims 2020-02-19 15:11:17 +01:00
Michele Artini 173f1df1e5 saved a query for openaire production database 2020-02-19 10:15:08 +01:00
Sandro La Bruzzo 9a2d74ac82 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-19 10:13:45 +01:00
Sandro La Bruzzo e5d7cdf422 fixed sql query 2020-02-19 10:13:36 +01:00
Sandro La Bruzzo 2b8675462f refactoring code 2020-02-19 10:07:08 +01:00
Claudio Atzori ed76521d9b removed stale test resources, will be re-added later on 2020-02-18 11:51:08 +01:00
Claudio Atzori 0f364605ff removed stale tests, need to reimplemente them anyway 2020-02-18 11:48:19 +01:00
Claudio Atzori 6a288625e5 fixed workflow outgoing node 2020-02-17 15:04:33 +01:00
Claudio Atzori 1b18fd4d54 sync with master branch 2020-02-17 13:49:46 +01:00
Sandro La Bruzzo 4f04759738 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-17 12:31:58 +01:00
Sandro La Bruzzo 76ee85141a added oozie job for DNET migration and implemented Spark job for extracting entities 2020-02-17 12:31:44 +01:00
Claudio Atzori c460e2d281 Aggiornare 'dhp-workflows/docs/oozie-installer.markdown' 2020-02-17 11:54:48 +01:00
Michele Artini 176c5606bd aligned with origin/master, aligned model and mapping 2020-02-17 10:40:53 +01:00
Claudio Atzori 56d1810a66 working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr 2020-02-14 12:28:52 +01:00
Claudio Atzori 1ee1baa8c0 Merge branch 'master' into provision_indexing 2020-02-13 18:17:07 +01:00
Claudio Atzori a3d0b57b25 [maven-release-plugin] prepare for next development iteration 2020-02-13 18:11:33 +01:00
Claudio Atzori 6ed9a15bc8 [maven-release-plugin] prepare release dhp-1.1.5 2020-02-13 18:11:31 +01:00
Claudio Atzori 49e648f7c3 bumped version 2020-02-13 18:09:31 +01:00
Claudio Atzori f9fae97e09 test json files aligned with the latest model changes 2020-02-13 18:05:59 +01:00
Claudio Atzori 1fee6e2b7e implemented XML records construction and serialization, indexing WIP 2020-02-13 16:53:27 +01:00
Michele Artini 80cb52593f bug fixing 2020-02-13 15:34:13 +01:00
Michele Artini cdea0dae75 bug fixing 2020-02-12 16:34:00 +01:00
Michele Artini 69336195d3 simplifications 2020-02-12 11:12:38 +01:00
Michele Artini 06c2fd6df9 bug fixing 2020-02-11 15:29:50 +01:00
Michele Artini 5fc09b179c bug fixing 2020-02-11 12:48:03 +01:00
Michele Artini 95740767e0 Ready for tests 2020-02-10 16:04:06 +01:00
Michele Artini 181e8498d4 ... 2020-02-07 16:02:49 +01:00
Michele Artini bb1533a07e partial commit 2020-02-05 15:35:40 +01:00
Michele Artini fbb0fc140b partial implementation of migration 2020-02-04 15:25:47 +01:00
Claudio Atzori 7ba0f44d05 WIP 2020-01-30 18:21:07 +01:00
Claudio Atzori 49ef2f4eb1 removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive 2020-01-30 18:20:26 +01:00
Claudio Atzori b5e1e2e5b2 reintegrated changes from fcbc4ccd70 2020-01-30 18:11:04 +01:00
Claudio Atzori 7bacd6812e Merge branch 'provision_indexing' of https://code-repo.d4science.org/D-Net/dnet-hadoop into HEAD
 Conflicts:
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/GraphJoiner.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/MappingUtils.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/RelatedEntity.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/SparkXmlRecordBuilderJob.java
2020-01-30 17:59:46 +01:00
Claudio Atzori b2691a3b0a save adjacency list as JoinedEntity 2020-01-30 17:46:29 +01:00
Claudio Atzori 8c2aff99b0 joining entities using T x R x S, WIP: last representation based on LinkedEntity type 2020-01-29 15:40:33 +01:00
Sandro La Bruzzo 19a80e4638 implemented workfow for aggregation and generation of infospace graph 2020-01-24 09:58:55 +01:00
Claudio Atzori fcbc4ccd70 a bit of docs doesn't hurt 2020-01-24 08:43:23 +01:00
Claudio Atzori a55f5fecc6 joining entities using T x R x S method with groupByKey, WIP: making target objects (T) have lower memory footprint 2020-01-24 08:17:53 +01:00
Michele Artini 6bfe2dc96e partial implementation 2020-01-22 16:00:23 +01:00
Claudio Atzori 799929c1e3 joining entities using T x R x S method with groupByKey 2020-01-21 16:35:44 +01:00
Michele Artini f6eccdde33 partial implementation 2020-01-21 14:17:05 +01:00
Michele Artini cd114f1c3b partial update 2020-01-21 12:32:10 +01:00
Michele Artini b35c59eb42 partial implementation of entities from db 2020-01-20 16:04:19 +01:00
Michele Artini 81f82b5d34 partial implementation of applications to migrate entities 2020-01-17 15:26:21 +01:00
Claudio Atzori 1cd6899480 merged from master 2020-01-17 14:25:57 +01:00
Claudio Atzori 97c239ee0d WIP: trying to find a way to build the records for the index 2020-01-16 12:02:28 +02:00
miconis 4955be0197 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-01-14 15:03:44 +02:00
miconis f61adfc2bb minor changes 2020-01-14 15:03:27 +02:00
miconis 9bdcb02179 minor changes and update of the configuration for publications 2020-01-14 15:01:03 +02:00
Michele Artini f7b9a7a9af entity migration (partial implementation) 2020-01-10 15:55:23 +01:00
Michele Artini 7229fecbcf fix warnings in poms 2019-12-20 13:41:08 +01:00
Sandro La Bruzzo dd21db7036 fixed stuff 2019-12-18 16:28:22 +01:00
Claudio Atzori 7ba586d2e5 oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed 2019-12-17 16:24:49 +01:00
Sandro La Bruzzo 76efcde4fd using new branch decisionTreeDedup 2019-12-13 12:20:35 +01:00
Sandro La Bruzzo b4392f9f43 implemented DedupRecord factory for missing entities 2019-12-13 09:40:02 +01:00
Sandro La Bruzzo 39367676d7 implemented DedupRecord factory with the merge of project 2019-12-12 15:18:48 +01:00
Sandro La Bruzzo 6b45e37e22 implemented DedupRecord factory with the merge of organizations 2019-12-11 16:57:37 +01:00
Sandro La Bruzzo abd9034da0 implemented DedupRecord factory with the merge of publications 2019-12-11 15:43:24 +01:00
miconis 4b66b471a4 implementation of the sorting by trust mechanism and the merge of oaf entities 2019-12-10 14:57:16 +01:00
Sandro La Bruzzo cc63706347 Implemented deduplication on spark 2019-12-06 13:38:00 +01:00
Sandro La Bruzzo aad0cb40b7 Added schema Scholexplorer 2019-11-14 10:34:09 +01:00
Claudio Atzori 5711e75f67 use ${project.version} whenever possible 2019-11-08 17:41:51 +01:00
Claudio Atzori 245b4cbbb3 removed import limit 2019-11-08 17:41:01 +01:00
Claudio Atzori 7fe6835b47 [maven-release-plugin] prepare for next development iteration 2019-11-07 17:39:30 +01:00
Claudio Atzori 58918967d9 [maven-release-plugin] prepare release dhp-1.0.4 2019-11-07 17:39:27 +01:00
Claudio Atzori 5308f05a02 allow to speficy the target hive DB name in the infospace import workflow 2019-11-07 17:38:09 +01:00
Claudio Atzori a52d5bde4f simplified import procedure, maps the infospace as hive tables 2019-11-06 17:45:52 +01:00
Claudio Atzori 1e7a2ac41d align parmeter names, graph import procedure WIP 2019-11-04 17:41:01 +01:00
Claudio Atzori f39148dab8 [maven-release-plugin] prepare for next development iteration 2019-11-04 12:34:48 +01:00
Claudio Atzori 34b0e7b40a [maven-release-plugin] prepare release dhp-1.0.3 2019-11-04 12:34:46 +01:00
Claudio Atzori 439ad80d81 conversion utilities from protobuffer model to DHP model moved in dnet-mapreduce-jobs. Removed also the relative protobuf dependencies 2019-11-04 12:33:23 +01:00
Claudio Atzori 32ed4ae8d6 conversion utilities from protobuffer model to DHP model moved in dnet-mapreduce-jobs. Removed also the relative protobuf dependencies 2019-11-04 12:28:56 +01:00
Sandro La Bruzzo fd0ad82111 [maven-release-plugin] prepare for next development iteration 2019-10-31 12:08:51 +01:00
Sandro La Bruzzo f224613b40 [maven-release-plugin] prepare release dhp-1.0.2 2019-10-31 12:08:49 +01:00
Sandro La Bruzzo e13c30cc96 [maven-release-plugin] rollback the release of dhp-1.0.2 2019-10-31 12:07:04 +01:00
Sandro La Bruzzo 4da5239203 [maven-release-plugin] prepare release dhp-1.0.2 2019-10-31 12:06:14 +01:00
Sandro La Bruzzo db8b346edd [maven-release-plugin] rollback the release of 1.0.1 2019-10-31 11:49:05 +01:00
Sandro La Bruzzo fc80052173 [maven-release-plugin] prepare for next development iteration 2019-10-31 11:47:42 +01:00
Sandro La Bruzzo 3150c7ce6d [maven-release-plugin] prepare release 1.0.1 2019-10-31 11:47:40 +01:00
Sandro La Bruzzo 18ec8e8147 moved protoutils function to dhp-schemas 2019-10-31 11:31:37 +01:00
Sandro La Bruzzo 997e57d45b Added entity filter to spark class 2019-10-30 12:19:03 +01:00
Sandro La Bruzzo a336956708 added defautl property to job 2019-10-30 12:01:42 +01:00
Claudio Atzori 78b5b57e86 trying to make the spark action to be run as spark2 2019-10-29 18:56:34 +01:00
Claudio Atzori c8bb81cd9a align dependencies with IIS cluster 2019-10-29 18:10:20 +01:00
Sandro La Bruzzo fe62ccd6dd implemented oozie wf 2019-10-28 12:12:50 +01:00
Sandro La Bruzzo 9ee4e5a196 remove a bit of syntactic sugar on the object inheritance :( 2019-10-25 18:10:30 +02:00
Sandro La Bruzzo c74335ebc7 resolved conflict 2019-10-25 14:34:50 +02:00
Sandro La Bruzzo 8c902c500a minor fix 2019-10-25 14:33:54 +02:00
miconis 9fa5aebe9c minor changes 2019-10-25 12:52:28 +02:00
miconis 551eda1600 dataset, orp and software mapping implemented. addition of test resources for results. implementation of tests to check the result of the mapping 2019-10-25 12:48:25 +02:00
Sandro La Bruzzo eef14fade3 fixed conflict 2019-10-25 11:58:20 +02:00
Sandro La Bruzzo 0ea7e861ab added organizations test 2019-10-25 11:56:28 +02:00
miconis 4908165e05 implementation of the createPublication method to map publications 2019-10-25 11:54:14 +02:00
miconis df37bd6aaf placeholders for setters in createpublication 2019-10-25 10:57:19 +02:00
Sandro La Bruzzo c8d6d6bbd1 implemented organization mapping 2019-10-25 10:23:51 +02:00
miconis b525b54130 starting implementing the createPublication class 2019-10-25 09:55:31 +02:00
Claudio Atzori 4b331790e7 resolved conflicts 2019-10-25 09:45:12 +02:00
Claudio Atzori c929c1dfac more proto 2 graph model mappings 2019-10-25 09:25:36 +02:00
Sandro La Bruzzo 09ffda03a2 removed circular dependencies 2019-10-25 09:24:18 +02:00
Sandro La Bruzzo a10d071cf4 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2019-10-24 17:55:44 +02:00
Sandro La Bruzzo 3a8bb11695 mapped first part 2019-10-24 17:55:40 +02:00
Claudio Atzori d46371ceab Merge branch 'master' of https://code-repo.d2science.org/D-Net/dnet-hadoop 2019-10-24 17:43:55 +02:00
Claudio Atzori 0d88f9a6a4 added mapping for projects 2019-10-24 17:43:42 +02:00
Sandro La Bruzzo 2dd9572f41 added Mapping of OriginalDescription 2019-10-24 17:36:44 +02:00
miconis 351d850ad3 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2019-10-24 17:29:07 +02:00
miconis b66a7e3030 publication test added 2019-10-24 17:29:01 +02:00
Sandro La Bruzzo 6c32d418ac added conversion of ExtraInfo 2019-10-24 17:26:55 +02:00
Claudio Atzori 5f339a2c24 added mappings for basic types 2019-10-24 17:21:45 +02:00