Alessia Bardi
|
09fc7e2f78
|
serialization of validated flag on relationships
|
2021-02-10 11:22:09 +01:00 |
Claudio Atzori
|
0374d34c3e
|
introduced configuration param outputFormat: HDFS | SOLR
|
2020-11-19 10:34:28 +01:00 |
Claudio Atzori
|
d9e07a242b
|
extended XmlIndexingJob to accept an optional parameter: outputPath. When present, forces the job to write its output on the specified HDFS location
|
2020-11-18 14:34:55 +01:00 |
Claudio Atzori
|
1871d1c6f6
|
solve error java.lang.NoSuchFieldError: INSTANCE when instantiating Solr client
|
2020-08-14 11:18:30 +02:00 |
Claudio Atzori
|
3a11a387a9
|
data provision workflow enhancement: added nodes to perform DELETE BY QUERY before the indexing begins and COMMIT after the indexing is completed
|
2020-08-03 14:28:08 +02:00 |
Claudio Atzori
|
cc5d13da85
|
introduced parameter shouldIndex (true|false)
|
2020-07-16 13:46:39 +02:00 |
Claudio Atzori
|
06c1913062
|
added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations
|
2020-07-10 19:03:33 +02:00 |
Claudio Atzori
|
b21866a2da
|
allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types
|
2020-07-10 13:59:48 +02:00 |
Claudio Atzori
|
b383ed42fa
|
pass optional parameter relationFilter to the PrepareRelationJob implementation
|
2020-07-07 14:21:28 +02:00 |
Claudio Atzori
|
7817338e05
|
added test to verify the relation pre-processing
|
2020-06-26 17:58:33 +02:00 |
Claudio Atzori
|
216975c4ec
|
restored complete provision workflow
|
2020-06-25 12:55:52 +02:00 |
Claudio Atzori
|
0e723d378b
|
added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job
|
2020-06-24 18:34:42 +02:00 |
Claudio Atzori
|
05f269a1c0
|
kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
|
2020-06-01 00:32:42 +02:00 |
Claudio Atzori
|
6f5f498c78
|
restored common properties driving executor-cores and executor-memory in join_organization_relations wf node
|
2020-05-29 11:22:00 +02:00 |
Claudio Atzori
|
b2f9564f13
|
WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
|
2020-05-29 10:58:15 +02:00 |
Claudio Atzori
|
5dea155a87
|
increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase
|
2020-05-28 13:49:59 +02:00 |
Claudio Atzori
|
cfd753217c
|
repartition the join_entities in 24k files
|
2020-05-27 12:44:01 +02:00 |
Claudio Atzori
|
4e36d689dd
|
fixed XML serialization for children sub-elements (duplicates & externalreferences)
|
2020-05-26 18:30:40 +02:00 |
Claudio Atzori
|
b33dd58be4
|
replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging
|
2020-05-22 08:50:06 +02:00 |
Claudio Atzori
|
dbfb9c19fe
|
minor changes
|
2020-05-21 10:00:14 +02:00 |
Claudio Atzori
|
d7d2a0637f
|
added extra parameters to the provision indexing workflow
|
2020-05-20 14:55:38 +02:00 |
Claudio Atzori
|
0bdfbb0a57
|
reintroduced RDD based relation cut off procedure
|
2020-05-19 15:02:21 +02:00 |
Claudio Atzori
|
8c67073a07
|
force speculative execution to false
|
2020-05-08 09:42:21 +02:00 |
Claudio Atzori
|
bac37b3973
|
fixed children expansion in XML records
|
2020-05-04 11:51:17 +02:00 |
Claudio Atzori
|
0b55795d4d
|
small adjustments in the provisioning workflow
|
2020-04-21 16:15:04 +02:00 |
Claudio Atzori
|
77f59b1b10
|
dataset based provision WIP
|
2020-04-06 19:37:27 +02:00 |
Claudio Atzori
|
ca345aaad3
|
dataset based provision WIP
|
2020-04-06 15:33:31 +02:00 |
Claudio Atzori
|
c8f4b95464
|
dataset based provision WIP
|
2020-04-06 08:59:58 +02:00 |
Claudio Atzori
|
eb2f5f3198
|
dataset based provision WIP
|
2020-04-04 17:41:31 +02:00 |
Claudio Atzori
|
3d1b637cab
|
dataset based provision WIP
|
2020-04-04 14:03:43 +02:00 |
Claudio Atzori
|
24b2c9012e
|
dataset based provision WIP
|
2020-04-02 18:44:09 +02:00 |
Claudio Atzori
|
daa26acc9d
|
dataset based provision WIP, fixed spark2EventLogDir
|
2020-04-02 16:15:50 +02:00 |
Claudio Atzori
|
9c7092416a
|
dataset based provision WIP
|
2020-04-01 19:07:30 +02:00 |
Claudio Atzori
|
adcdd2d05e
|
WIP: reimplementing the adjacency list construction process using spark Datasets
|
2020-04-01 14:56:57 +02:00 |
Claudio Atzori
|
0fbec69b82
|
use oozie prepare statement to cleanup working directories
|
2020-03-30 19:48:41 +02:00 |
Claudio Atzori
|
f3f9affd49
|
allow dynamic executors to build XML records
|
2020-03-30 13:12:11 +02:00 |
Claudio Atzori
|
673e744649
|
moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa
|
2020-03-27 10:42:17 +01:00 |
Claudio Atzori
|
abe8fb69a2
|
added global properties, moved postprocessing script inside the oozie_app directory
|
2020-03-18 15:43:54 +01:00 |
Claudio Atzori
|
1e563bc15e
|
introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase
|
2020-03-04 10:55:11 +01:00 |
Claudio Atzori
|
bc7cfd5975
|
indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure
|
2020-03-02 17:03:07 +01:00 |
Claudio Atzori
|
56d1810a66
|
working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr
|
2020-02-14 12:28:52 +01:00 |
Claudio Atzori
|
1fee6e2b7e
|
implemented XML records construction and serialization, indexing WIP
|
2020-02-13 16:53:27 +01:00 |
Claudio Atzori
|
49ef2f4eb1
|
removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive
|
2020-01-30 18:20:26 +01:00 |
Claudio Atzori
|
b2691a3b0a
|
save adjacency list as JoinedEntity
|
2020-01-30 17:46:29 +01:00 |
Claudio Atzori
|
799929c1e3
|
joining entities using T x R x S method with groupByKey
|
2020-01-21 16:35:44 +01:00 |
Claudio Atzori
|
97c239ee0d
|
WIP: trying to find a way to build the records for the index
|
2020-01-16 12:02:28 +02:00 |
Claudio Atzori
|
7ba586d2e5
|
oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed
|
2019-12-17 16:24:49 +01:00 |