Commit Graph

38 Commits

Author SHA1 Message Date
Claudio Atzori e355961997 dataset based provision WIP 2020-04-06 17:34:25 +02:00
Claudio Atzori ca345aaad3 dataset based provision WIP 2020-04-06 15:33:31 +02:00
Claudio Atzori c8f4b95464 dataset based provision WIP 2020-04-06 08:59:58 +02:00
Claudio Atzori eb2f5f3198 dataset based provision WIP 2020-04-04 17:41:31 +02:00
Claudio Atzori 3d1b637cab dataset based provision WIP 2020-04-04 14:03:43 +02:00
Claudio Atzori 24b2c9012e dataset based provision WIP 2020-04-02 18:44:09 +02:00
Claudio Atzori daa26acc9d dataset based provision WIP, fixed spark2EventLogDir 2020-04-02 16:15:50 +02:00
Claudio Atzori 9c7092416a dataset based provision WIP 2020-04-01 19:07:30 +02:00
Claudio Atzori 1402eb1fe7 cleanup 2020-04-01 15:38:50 +02:00
Claudio Atzori adcdd2d05e WIP: reimplementing the adjacency list construction process using spark Datasets 2020-04-01 14:56:57 +02:00
Claudio Atzori 0fbec69b82 use oozie prepare statement to cleanup working directories 2020-03-30 19:48:41 +02:00
Claudio Atzori f3f9affd49 allow dynamic executors to build XML records 2020-03-30 13:12:11 +02:00
Claudio Atzori 2e2d4c4c68 adjusted path to template resource 2020-03-30 13:11:49 +02:00
Claudio Atzori 673e744649 moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa 2020-03-27 10:42:17 +01:00
Claudio Atzori abe8fb69a2 added global properties, moved postprocessing script inside the oozie_app directory 2020-03-18 15:43:54 +01:00
Claudio Atzori aeb01fa353 reading from newline delimited json textfiles instead of sequence files 2020-03-17 11:57:24 +01:00
Claudio Atzori a3f184fd3f added field websiteurl in related organizations 2020-03-10 17:08:58 +01:00
Claudio Atzori 0e95544495 fixed serialization for datasource subjects 2020-03-10 17:07:44 +01:00
Claudio Atzori 5e342a555c no need to compute the inverse relClass, fixed text() in xpath expressions 2020-03-05 12:51:48 +01:00
Claudio Atzori 6ec04d4e02 specified column used to perform the join operation in the javadoc 2020-03-05 12:50:38 +01:00
Claudio Atzori 1e563bc15e introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase 2020-03-04 10:55:11 +01:00
Claudio Atzori bc7cfd5975 indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure 2020-03-02 17:03:07 +01:00
Claudio Atzori 60bc2b1a20 drop the hive DB before populating it from scratch 2020-02-27 10:10:55 +01:00
Claudio Atzori 6a73fd5da5 in order to reuse the same XmlRecordFactory across different tasks, the state of contexts must be one per record built 2020-02-21 09:17:19 +01:00
Claudio Atzori 33185fd0b7 ISLookupClientFactory moved in dhp-common 2020-02-19 16:56:38 +01:00
Claudio Atzori 56d1810a66 working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr 2020-02-14 12:28:52 +01:00
Claudio Atzori 1fee6e2b7e implemented XML records construction and serialization, indexing WIP 2020-02-13 16:53:27 +01:00
Claudio Atzori 7ba0f44d05 WIP 2020-01-30 18:21:07 +01:00
Claudio Atzori 49ef2f4eb1 removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive 2020-01-30 18:20:26 +01:00
Claudio Atzori b5e1e2e5b2 reintegrated changes from fcbc4ccd70 2020-01-30 18:11:04 +01:00
Claudio Atzori 7bacd6812e Merge branch 'provision_indexing' of https://code-repo.d4science.org/D-Net/dnet-hadoop into HEAD
 Conflicts:
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/GraphJoiner.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/MappingUtils.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/RelatedEntity.java
	dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/SparkXmlRecordBuilderJob.java
2020-01-30 17:59:46 +01:00
Claudio Atzori b2691a3b0a save adjacency list as JoinedEntity 2020-01-30 17:46:29 +01:00
Claudio Atzori 8c2aff99b0 joining entities using T x R x S, WIP: last representation based on LinkedEntity type 2020-01-29 15:40:33 +01:00
Claudio Atzori fcbc4ccd70 a bit of docs doesn't hurt 2020-01-24 08:43:23 +01:00
Claudio Atzori a55f5fecc6 joining entities using T x R x S method with groupByKey, WIP: making target objects (T) have lower memory footprint 2020-01-24 08:17:53 +01:00
Claudio Atzori 799929c1e3 joining entities using T x R x S method with groupByKey 2020-01-21 16:35:44 +01:00
Claudio Atzori 97c239ee0d WIP: trying to find a way to build the records for the index 2020-01-16 12:02:28 +02:00
Claudio Atzori 7ba586d2e5 oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed 2019-12-17 16:24:49 +01:00