Commit Graph

1372 Commits

Author SHA1 Message Date
Sandro La Bruzzo 13815d5d13 improvement DOIBoost 2020-06-01 17:52:12 +02:00
Claudio Atzori 05f269a1c0 kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-06-01 00:32:42 +02:00
Claudio Atzori 5e23fb3a74 code formatting 2020-05-30 10:52:56 +02:00
Claudio Atzori 54ca8ed6c3 uniformed param name (isLookupUrl), Vocab model classes defined as Serializable 2020-05-29 18:17:30 +02:00
Claudio Atzori 1577bd5b8b added IsLookupUrl to the raw_db workflow parameters 2020-05-29 16:18:16 +02:00
Claudio Atzori 91d78b825b Merge pull request 'import from db using is vocabularies' (#17) from result_pids into master
Looks good, thanks Michele!
2020-05-29 16:02:40 +02:00
Michele Artini adb798faa5 import from db using is vocabularies 2020-05-29 12:03:51 +02:00
Claudio Atzori 6f5f498c78 restored common properties driving executor-cores and executor-memory in join_organization_relations wf node 2020-05-29 11:22:00 +02:00
Claudio Atzori b2f9564f13 WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-05-29 10:58:15 +02:00
Miriam Baglioni dfa4997a4f removed commented code 2020-05-29 10:45:18 +02:00
Miriam Baglioni 6f1eea28b6 changed message in log 2020-05-29 10:41:39 +02:00
Sandro La Bruzzo b87b3ddb6b changed mapping ORCIDToOAF 2020-05-29 09:32:04 +02:00
Miriam Baglioni 8b6e886fb6 added new resource for testing 2020-05-28 23:54:31 +02:00
Miriam Baglioni 6989fb9c8a changed the project test according to the newly introduced join with the db project codes 2020-05-28 23:53:24 +02:00
Miriam Baglioni 782984d8e5 added needed parameter 2020-05-28 23:52:41 +02:00
Miriam Baglioni 01f7876595 fix issue with flatMap - the return type must not be null 2020-05-28 23:50:32 +02:00
Claudio Atzori c8d5d80210 Merge branch 'dhp_oaf_model' of https://code-repo.d4science.org/D-Net/dnet-hadoop into dhp_oaf_model 2020-05-28 18:00:24 +02:00
Claudio Atzori a57965a3ea limiting the dimensions of outliers 2020-05-28 17:36:37 +02:00
Miriam Baglioni 773735f870 added the path to the file containing the projects code from the db 2020-05-28 17:30:45 +02:00
Miriam Baglioni 6a15067a64 added one step in the workflow 2020-05-28 17:30:09 +02:00
Miriam Baglioni 5309a99a70 modified the PrepareProjects to consider those in the db 2020-05-28 17:29:53 +02:00
Miriam Baglioni b737ed8236 added part to read projects from the openaire db to filter out those in the csv file that are not in the db 2020-05-28 17:29:21 +02:00
Claudio Atzori 821be1f8b6 experimental implementation of custom aggregation using kryo encoders 2020-05-28 13:53:13 +02:00
Claudio Atzori 83504ecace limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit 2020-05-28 13:52:30 +02:00
Claudio Atzori ef11593068 JoinedEntity.links defined as empty list by default 2020-05-28 13:50:44 +02:00
Claudio Atzori 5dea155a87 increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase 2020-05-28 13:49:59 +02:00
Miriam Baglioni 35b7279147 changed test because data are saved as SequenceFile now, and because of the group by the umber of produced update decrease 2020-05-28 10:26:12 +02:00
Miriam Baglioni 37c155b86a merge branch with fork master 2020-05-28 10:09:51 +02:00
Miriam Baglioni df44db686a refactoring 2020-05-28 10:07:00 +02:00
Miriam Baglioni 87b07f4af8 removed unused variables 2020-05-28 10:05:43 +02:00
Miriam Baglioni 1060977272 added fs actions to remove and the create the workingDir 2020-05-28 10:04:36 +02:00
Miriam Baglioni 96d1a3c431 deleted the file were to store the csv files 2020-05-28 10:04:10 +02:00
Miriam Baglioni 669c05c771 added groupBy before creating Actions 2020-05-28 10:00:45 +02:00
Sandro La Bruzzo 02f90eeb07 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-28 09:58:32 +02:00
Sandro La Bruzzo 7d29b61c62 code refactor 2020-05-28 09:57:46 +02:00
Claudio Atzori fdd54bad1c code formatting 2020-05-27 19:31:54 +02:00
Miriam Baglioni 1855453434 changed the outputdir of the last step 2020-05-27 17:59:36 +02:00
Miriam Baglioni dd1e0b93b8 added merge for Programme 2020-05-27 17:40:32 +02:00
Miriam Baglioni f3dcca0dd0 added equals for programme 2020-05-27 17:23:34 +02:00
Claudio Atzori b9b1bc9967 Merge branch 'master' into provision_indexing 2020-05-27 12:55:20 +02:00
Claudio Atzori aac1515b58 Merge pull request 'result_pids without conflicts ???' (#16) from result_pids into master
Looks good, thanks Michele
2020-05-27 12:54:52 +02:00
Michele Artini f5ce7d76e1 resolve conflicts 2020-05-27 12:49:17 +02:00
Claudio Atzori cfd753217c repartition the join_entities in 24k files 2020-05-27 12:44:01 +02:00
Claudio Atzori 2f1a623d09 sync from master branch 2020-05-27 12:39:58 +02:00
Claudio Atzori 9e4ec1543b updated test 2020-05-27 12:38:42 +02:00
Claudio Atzori 8047d16dd9 added RDD based adjacency list creation procedure 2020-05-27 12:38:12 +02:00
Claudio Atzori f057dcdf65 limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES 2020-05-27 12:37:33 +02:00
Michele Artini b81f2741d2 xquery 2020-05-27 12:10:20 +02:00
Michele Artini a25598140a result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 7a7272d9ec result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00