Claudio Atzori
|
5dea155a87
|
increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase
|
2020-05-28 13:49:59 +02:00 |
Claudio Atzori
|
cfd753217c
|
repartition the join_entities in 24k files
|
2020-05-27 12:44:01 +02:00 |
Claudio Atzori
|
4e36d689dd
|
fixed XML serialization for children sub-elements (duplicates & externalreferences)
|
2020-05-26 18:30:40 +02:00 |
Claudio Atzori
|
b33dd58be4
|
replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging
|
2020-05-22 08:50:06 +02:00 |
Claudio Atzori
|
dbfb9c19fe
|
minor changes
|
2020-05-21 10:00:14 +02:00 |
Claudio Atzori
|
d7d2a0637f
|
added extra parameters to the provision indexing workflow
|
2020-05-20 14:55:38 +02:00 |
Claudio Atzori
|
0bdfbb0a57
|
reintroduced RDD based relation cut off procedure
|
2020-05-19 15:02:21 +02:00 |
Claudio Atzori
|
8c67073a07
|
force speculative execution to false
|
2020-05-08 09:42:21 +02:00 |
Claudio Atzori
|
bac37b3973
|
fixed children expansion in XML records
|
2020-05-04 11:51:17 +02:00 |
Claudio Atzori
|
0b55795d4d
|
small adjustments in the provisioning workflow
|
2020-04-21 16:15:04 +02:00 |
Claudio Atzori
|
77f59b1b10
|
dataset based provision WIP
|
2020-04-06 19:37:27 +02:00 |
Claudio Atzori
|
ca345aaad3
|
dataset based provision WIP
|
2020-04-06 15:33:31 +02:00 |
Claudio Atzori
|
c8f4b95464
|
dataset based provision WIP
|
2020-04-06 08:59:58 +02:00 |
Claudio Atzori
|
eb2f5f3198
|
dataset based provision WIP
|
2020-04-04 17:41:31 +02:00 |
Claudio Atzori
|
3d1b637cab
|
dataset based provision WIP
|
2020-04-04 14:03:43 +02:00 |
Claudio Atzori
|
24b2c9012e
|
dataset based provision WIP
|
2020-04-02 18:44:09 +02:00 |
Claudio Atzori
|
daa26acc9d
|
dataset based provision WIP, fixed spark2EventLogDir
|
2020-04-02 16:15:50 +02:00 |
Claudio Atzori
|
9c7092416a
|
dataset based provision WIP
|
2020-04-01 19:07:30 +02:00 |
Claudio Atzori
|
adcdd2d05e
|
WIP: reimplementing the adjacency list construction process using spark Datasets
|
2020-04-01 14:56:57 +02:00 |
Claudio Atzori
|
0fbec69b82
|
use oozie prepare statement to cleanup working directories
|
2020-03-30 19:48:41 +02:00 |
Claudio Atzori
|
f3f9affd49
|
allow dynamic executors to build XML records
|
2020-03-30 13:12:11 +02:00 |
Claudio Atzori
|
673e744649
|
moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa
|
2020-03-27 10:42:17 +01:00 |
Claudio Atzori
|
abe8fb69a2
|
added global properties, moved postprocessing script inside the oozie_app directory
|
2020-03-18 15:43:54 +01:00 |
Claudio Atzori
|
1e563bc15e
|
introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase
|
2020-03-04 10:55:11 +01:00 |
Claudio Atzori
|
bc7cfd5975
|
indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure
|
2020-03-02 17:03:07 +01:00 |
Claudio Atzori
|
56d1810a66
|
working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr
|
2020-02-14 12:28:52 +01:00 |
Claudio Atzori
|
1fee6e2b7e
|
implemented XML records construction and serialization, indexing WIP
|
2020-02-13 16:53:27 +01:00 |
Claudio Atzori
|
49ef2f4eb1
|
removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive
|
2020-01-30 18:20:26 +01:00 |
Claudio Atzori
|
b2691a3b0a
|
save adjacency list as JoinedEntity
|
2020-01-30 17:46:29 +01:00 |
Claudio Atzori
|
799929c1e3
|
joining entities using T x R x S method with groupByKey
|
2020-01-21 16:35:44 +01:00 |
Claudio Atzori
|
97c239ee0d
|
WIP: trying to find a way to build the records for the index
|
2020-01-16 12:02:28 +02:00 |
Claudio Atzori
|
7ba586d2e5
|
oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed
|
2019-12-17 16:24:49 +01:00 |