Claudio Atzori
|
3a11a387a9
|
data provision workflow enhancement: added nodes to perform DELETE BY QUERY before the indexing begins and COMMIT after the indexing is completed
|
2020-08-03 14:28:08 +02:00 |
Claudio Atzori
|
cc5d13da85
|
introduced parameter shouldIndex (true|false)
|
2020-07-16 13:46:39 +02:00 |
Claudio Atzori
|
b098cc3cbe
|
avoid repeating identical values for fields: source, description
|
2020-07-16 13:45:53 +02:00 |
Claudio Atzori
|
7d6e269b40
|
reverted CreateRelatedEntitiesJob_phase1 to its previous state
|
2020-07-13 22:54:04 +02:00 |
Claudio Atzori
|
8e97598eb4
|
avoid to NPE in case of null instances
|
2020-07-13 20:46:14 +02:00 |
Claudio Atzori
|
06c1913062
|
added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations
|
2020-07-10 19:03:33 +02:00 |
Claudio Atzori
|
4c3836f62e
|
materialize the related entities before joining them
|
2020-07-10 19:00:44 +02:00 |
Claudio Atzori
|
b21866a2da
|
allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types
|
2020-07-10 13:59:48 +02:00 |
Claudio Atzori
|
ff4d6214f1
|
experimenting with pruning of relations
|
2020-07-10 10:06:41 +02:00 |
Claudio Atzori
|
b383ed42fa
|
pass optional parameter relationFilter to the PrepareRelationJob implementation
|
2020-07-07 14:21:28 +02:00 |
Claudio Atzori
|
d380b85246
|
unit test for the preparation of the relations
|
2020-07-02 12:42:13 +02:00 |
Claudio Atzori
|
7817338e05
|
added test to verify the relation pre-processing
|
2020-06-26 17:58:33 +02:00 |
Claudio Atzori
|
8d59fdf34e
|
WIP: dataset based PrepareRelationsJob
|
2020-06-26 14:32:58 +02:00 |
Claudio Atzori
|
216975c4ec
|
restored complete provision workflow
|
2020-06-25 12:55:52 +02:00 |
Claudio Atzori
|
93f627ea51
|
code formatting
|
2020-06-25 12:54:21 +02:00 |
Claudio Atzori
|
e62333192c
|
WIP: prepare relation job
|
2020-06-25 12:22:18 +02:00 |
Claudio Atzori
|
6933ec11fb
|
WIP: prepare relation job
|
2020-06-25 11:04:12 +02:00 |
Sandro La Bruzzo
|
a6c0faac70
|
added test to verify secondary sorting
|
2020-06-25 10:48:15 +02:00 |
Claudio Atzori
|
69b0391708
|
WIP: prepare relation job
|
2020-06-25 10:19:56 +02:00 |
Claudio Atzori
|
46e76affeb
|
WIP: prepare relation job
|
2020-06-24 19:01:15 +02:00 |
Claudio Atzori
|
0e723d378b
|
added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job
|
2020-06-24 18:34:42 +02:00 |
Claudio Atzori
|
463489f59f
|
code formatting
|
2020-06-12 12:03:25 +02:00 |
Claudio Atzori
|
4bcad1c9c3
|
Merge branch 'graph_cleaning'
|
2020-06-12 11:40:25 +02:00 |
Alessia Bardi
|
e79943965b
|
Fixes #5604: field oamandatepublications in XML
|
2020-06-11 12:49:31 +02:00 |
Claudio Atzori
|
67c7b31ba6
|
Merge branch 'master' into graph_cleaning
|
2020-06-10 15:00:35 +02:00 |
Claudio Atzori
|
ce12f236bb
|
disabled test, need to need to update the joined_entity.json file
|
2020-06-09 20:07:36 +02:00 |
Claudio Atzori
|
a2fdf85ba1
|
WIP: graph cleaner implementation
|
2020-06-09 19:52:53 +02:00 |
Claudio Atzori
|
05f269a1c0
|
kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
|
2020-06-01 00:32:42 +02:00 |
Claudio Atzori
|
6f5f498c78
|
restored common properties driving executor-cores and executor-memory in join_organization_relations wf node
|
2020-05-29 11:22:00 +02:00 |
Claudio Atzori
|
b2f9564f13
|
WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
|
2020-05-29 10:58:15 +02:00 |
Claudio Atzori
|
a57965a3ea
|
limiting the dimensions of outliers
|
2020-05-28 17:36:37 +02:00 |
Claudio Atzori
|
821be1f8b6
|
experimental implementation of custom aggregation using kryo encoders
|
2020-05-28 13:53:13 +02:00 |
Claudio Atzori
|
83504ecace
|
limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit
|
2020-05-28 13:52:30 +02:00 |
Claudio Atzori
|
ef11593068
|
JoinedEntity.links defined as empty list by default
|
2020-05-28 13:50:44 +02:00 |
Claudio Atzori
|
5dea155a87
|
increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase
|
2020-05-28 13:49:59 +02:00 |
Claudio Atzori
|
fdd54bad1c
|
code formatting
|
2020-05-27 19:31:54 +02:00 |
Claudio Atzori
|
cfd753217c
|
repartition the join_entities in 24k files
|
2020-05-27 12:44:01 +02:00 |
Claudio Atzori
|
2f1a623d09
|
sync from master branch
|
2020-05-27 12:39:58 +02:00 |
Claudio Atzori
|
9e4ec1543b
|
updated test
|
2020-05-27 12:38:42 +02:00 |
Claudio Atzori
|
8047d16dd9
|
added RDD based adjacency list creation procedure
|
2020-05-27 12:38:12 +02:00 |
Claudio Atzori
|
f057dcdf65
|
limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES
|
2020-05-27 12:37:33 +02:00 |
Claudio Atzori
|
4e36d689dd
|
fixed XML serialization for children sub-elements (duplicates & externalreferences)
|
2020-05-26 18:30:40 +02:00 |
Claudio Atzori
|
b8e541a454
|
fixing repeated organization.websiteurl in organization entities (#5645) as well as project.ecinternationalorganizationeurinterests
|
2020-05-26 10:30:09 +02:00 |
Claudio Atzori
|
925d933204
|
making XmlRecordFactory immune to graph encoding changes (mostly to avoid NPEs)
|
2020-05-22 08:50:44 +02:00 |
Claudio Atzori
|
b33dd58be4
|
replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging
|
2020-05-22 08:50:06 +02:00 |
Claudio Atzori
|
dbfb9c19fe
|
minor changes
|
2020-05-21 10:00:14 +02:00 |
Claudio Atzori
|
d7d2a0637f
|
added extra parameters to the provision indexing workflow
|
2020-05-20 14:55:38 +02:00 |
Claudio Atzori
|
0bdfbb0a57
|
reintroduced RDD based relation cut off procedure
|
2020-05-19 15:02:21 +02:00 |
Claudio Atzori
|
8c67073a07
|
force speculative execution to false
|
2020-05-08 09:42:21 +02:00 |
Claudio Atzori
|
bac37b3973
|
fixed children expansion in XML records
|
2020-05-04 11:51:17 +02:00 |