Commit Graph

1648 Commits

Author SHA1 Message Date
Alessia Bardi 16cb073b15 set the instance datepfacceptance with the Crossref createdDate in case the issuedDate is blank 2020-06-08 19:06:03 +02:00
Michele Artini bb659d870c join simrels 2020-06-08 16:29:01 +02:00
Michele Artini 81e85465d8 join simrels 2020-06-08 16:26:16 +02:00
Claudio Atzori 3d871c6651 Merge branch 'master' into graph_cleaning 2020-06-08 15:23:24 +02:00
Claudio Atzori 25a093b1a4 integrated changes from master 2020-06-08 15:04:00 +02:00
Sandro La Bruzzo e34e7d6728 merge DOIBoost 2020-06-08 08:32:22 +02:00
Sandro La Bruzzo e46e2a4776 Merge remote-tracking branch 'origin/master' into doiboost 2020-06-08 08:17:14 +02:00
Spyros Zoupanos 3576dd186b Adding hive timeout as workflow parameter 2020-06-05 22:29:54 +03:00
Claudio Atzori b2349659cf WIP: graph property fixing implementation 2020-06-05 18:37:38 +02:00
Michele Artini a73973a74b partial implemantation of broker events generation 2020-06-05 11:43:00 +02:00
Michele Artini 7e82996e7c partial implemantation of broker events generation 2020-06-04 17:10:43 +02:00
Sandro La Bruzzo b57e8ba374 Merge remote-tracking branch 'origin/master' into doiboost 2020-06-04 14:39:41 +02:00
Sandro La Bruzzo 7ac1ba2e35 improvement DOIBoost 2020-06-04 14:39:20 +02:00
Michele Artini 97177d7f7b partial refactoring 2020-06-04 10:26:34 +02:00
Sandro La Bruzzo 13815d5d13 improvement DOIBoost 2020-06-01 17:52:12 +02:00
Claudio Atzori 05f269a1c0 kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-06-01 00:32:42 +02:00
Claudio Atzori 5e23fb3a74 code formatting 2020-05-30 10:52:56 +02:00
Claudio Atzori 54ca8ed6c3 uniformed param name (isLookupUrl), Vocab model classes defined as Serializable 2020-05-29 18:17:30 +02:00
Claudio Atzori 1577bd5b8b added IsLookupUrl to the raw_db workflow parameters 2020-05-29 16:18:16 +02:00
Claudio Atzori 91d78b825b Merge pull request 'import from db using is vocabularies' (#17) from result_pids into master
Looks good, thanks Michele!
2020-05-29 16:02:40 +02:00
Michele Artini adb798faa5 import from db using is vocabularies 2020-05-29 12:03:51 +02:00
Claudio Atzori 6f5f498c78 restored common properties driving executor-cores and executor-memory in join_organization_relations wf node 2020-05-29 11:22:00 +02:00
Claudio Atzori b2f9564f13 WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-05-29 10:58:15 +02:00
Miriam Baglioni dfa4997a4f removed commented code 2020-05-29 10:45:18 +02:00
Miriam Baglioni 6f1eea28b6 changed message in log 2020-05-29 10:41:39 +02:00
Sandro La Bruzzo b87b3ddb6b changed mapping ORCIDToOAF 2020-05-29 09:32:04 +02:00
Miriam Baglioni 8b6e886fb6 added new resource for testing 2020-05-28 23:54:31 +02:00
Miriam Baglioni 6989fb9c8a changed the project test according to the newly introduced join with the db project codes 2020-05-28 23:53:24 +02:00
Miriam Baglioni 782984d8e5 added needed parameter 2020-05-28 23:52:41 +02:00
Miriam Baglioni 01f7876595 fix issue with flatMap - the return type must not be null 2020-05-28 23:50:32 +02:00
Claudio Atzori a57965a3ea limiting the dimensions of outliers 2020-05-28 17:36:37 +02:00
Miriam Baglioni 773735f870 added the path to the file containing the projects code from the db 2020-05-28 17:30:45 +02:00
Miriam Baglioni 6a15067a64 added one step in the workflow 2020-05-28 17:30:09 +02:00
Miriam Baglioni 5309a99a70 modified the PrepareProjects to consider those in the db 2020-05-28 17:29:53 +02:00
Miriam Baglioni b737ed8236 added part to read projects from the openaire db to filter out those in the csv file that are not in the db 2020-05-28 17:29:21 +02:00
Claudio Atzori 821be1f8b6 experimental implementation of custom aggregation using kryo encoders 2020-05-28 13:53:13 +02:00
Claudio Atzori 83504ecace limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit 2020-05-28 13:52:30 +02:00
Claudio Atzori ef11593068 JoinedEntity.links defined as empty list by default 2020-05-28 13:50:44 +02:00
Claudio Atzori 5dea155a87 increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase 2020-05-28 13:49:59 +02:00
Miriam Baglioni 35b7279147 changed test because data are saved as SequenceFile now, and because of the group by the umber of produced update decrease 2020-05-28 10:26:12 +02:00
Miriam Baglioni 37c155b86a merge branch with fork master 2020-05-28 10:09:51 +02:00
Miriam Baglioni df44db686a refactoring 2020-05-28 10:07:00 +02:00
Miriam Baglioni 87b07f4af8 removed unused variables 2020-05-28 10:05:43 +02:00
Miriam Baglioni 1060977272 added fs actions to remove and the create the workingDir 2020-05-28 10:04:36 +02:00
Miriam Baglioni 96d1a3c431 deleted the file were to store the csv files 2020-05-28 10:04:10 +02:00
Miriam Baglioni 669c05c771 added groupBy before creating Actions 2020-05-28 10:00:45 +02:00
Sandro La Bruzzo 02f90eeb07 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-28 09:58:32 +02:00
Sandro La Bruzzo 7d29b61c62 code refactor 2020-05-28 09:57:46 +02:00
Claudio Atzori fdd54bad1c code formatting 2020-05-27 19:31:54 +02:00
Miriam Baglioni 1855453434 changed the outputdir of the last step 2020-05-27 17:59:36 +02:00
Claudio Atzori b9b1bc9967 Merge branch 'master' into provision_indexing 2020-05-27 12:55:20 +02:00
Claudio Atzori aac1515b58 Merge pull request 'result_pids without conflicts ???' (#16) from result_pids into master
Looks good, thanks Michele
2020-05-27 12:54:52 +02:00
Michele Artini f5ce7d76e1 resolve conflicts 2020-05-27 12:49:17 +02:00
Claudio Atzori cfd753217c repartition the join_entities in 24k files 2020-05-27 12:44:01 +02:00
Claudio Atzori 2f1a623d09 sync from master branch 2020-05-27 12:39:58 +02:00
Claudio Atzori 9e4ec1543b updated test 2020-05-27 12:38:42 +02:00
Claudio Atzori 8047d16dd9 added RDD based adjacency list creation procedure 2020-05-27 12:38:12 +02:00
Claudio Atzori f057dcdf65 limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES 2020-05-27 12:37:33 +02:00
Michele Artini b81f2741d2 xquery 2020-05-27 12:10:20 +02:00
Michele Artini a25598140a result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 7a7272d9ec result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 3ceb2d2853 match terms with vocabularies 2020-05-27 11:34:13 +02:00
Claudio Atzori 4e36d689dd fixed XML serialization for children sub-elements (duplicates & externalreferences) 2020-05-26 18:30:40 +02:00
Miriam Baglioni 92e3a52e91 merge branch with fork master 2020-05-26 15:57:51 +02:00
Michele Artini c15d997925 xquery 2020-05-26 13:13:17 +02:00
Michele Artini c6af36496a result pids (new xpaths + IS vocabularies) 2020-05-26 13:11:09 +02:00
Michele Artini 093f1aff03 result pids (new xpaths + IS vocabularies) 2020-05-26 13:06:55 +02:00
Claudio Atzori b8e541a454 fixing repeated organization.websiteurl in organization entities (#5645) as well as project.ecinternationalorganizationeurinterests 2020-05-26 10:30:09 +02:00
Claudio Atzori 55595d7235 HACK: patch NULL values with defaults found in result.datainfo.deletedbyinference and result.context 2020-05-26 10:28:35 +02:00
Claudio Atzori 7b288a94cb code formatting 2020-05-26 09:54:13 +02:00
Miriam Baglioni 54d869e618 merge upstream 2020-05-26 09:22:04 +02:00
Miriam Baglioni eea07f4c42 refactoring 2020-05-26 09:21:49 +02:00
Sandro La Bruzzo 79c26382da Merge remote-tracking branch 'origin/master' into doiboost 2020-05-26 09:15:50 +02:00
Sandro La Bruzzo 25f52e19a4 implemented generation of ActionSet 2020-05-26 09:15:33 +02:00
Michele Artini d6aada4957 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-26 08:44:31 +02:00
Michele Artini b1546605e3 updated version of a dependency 2020-05-26 08:44:15 +02:00
Claudio Atzori 7582532e73 [maven-release-plugin] prepare for next development iteration 2020-05-25 19:48:18 +02:00
Claudio Atzori 01c2e93395 [maven-release-plugin] prepare release dhp-1.2.1 2020-05-25 19:48:14 +02:00
miconis da1e5cf557 implementation of the result title merge. main title with higher trust, distinct between the others 2020-05-25 18:02:57 +02:00
Miriam Baglioni d3d36647d2 merge upstream 2020-05-25 10:38:22 +02:00
Miriam Baglioni 74215f6d9f refactoring 2020-05-25 10:38:16 +02:00
Miriam Baglioni dbde2d243a changed due to move of PacePerson from dhp-graph-mapper to dhp-common 2020-05-25 10:35:39 +02:00
Miriam Baglioni f754c424bd changed logic to compute only onece PacePerson for each Author to be enriched 2020-05-25 10:35:02 +02:00
Miriam Baglioni 8f51af4e9b added PacePerson to get name surname for authors having only fullname set 2020-05-25 10:34:30 +02:00
Miriam Baglioni b258f99ece fix for issue that duplicated result 2020-05-25 10:26:48 +02:00
Miriam Baglioni 8f6ce970f9 moved PacePerson to dhp-common to avoid conflict in dependency with graph-mapper 2020-05-25 10:25:55 +02:00
Claudio Atzori de108f54d6 code formatting 2020-05-23 10:21:19 +02:00
Claudio Atzori 6b56cae57d added mapping for bestaccessrights 2020-05-23 09:57:39 +02:00
Claudio Atzori 7181807e64 code formatting 2020-05-23 09:51:48 +02:00
Sandro La Bruzzo 2408083566 implemented filtering step 2020-05-23 08:46:49 +02:00
Sandro La Bruzzo 244f6e50cf Merge remote-tracking branch 'origin/master' into doiboost 2020-05-22 20:52:15 +02:00
Sandro La Bruzzo 147dd389bf minor fix 2020-05-22 20:51:42 +02:00
Miriam Baglioni 0d1ec1913f added fix to avoid duplication of results 2020-05-22 18:42:25 +02:00
miconis 5d7ac78c41 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-22 17:25:08 +02:00
miconis 0fd0c7d725 reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short 2020-05-22 17:24:57 +02:00
Michele Artini eb606dc1e2 partial implementation of events with rels 2020-05-22 17:17:41 +02:00
Miriam Baglioni 29066a6b46 applied code cleanup 2020-05-22 15:38:50 +02:00
Miriam Baglioni 8610ad5142 added groupby id to fix multiple result with same id at join step 2020-05-22 15:32:55 +02:00
Miriam Baglioni 1e44703e3e merge upstream 2020-05-22 15:30:07 +02:00
Miriam Baglioni ac8025f469 - 2020-05-22 15:29:41 +02:00
Miriam Baglioni 50ad83b97f - 2020-05-22 15:27:19 +02:00
Miriam Baglioni 473c6d3a23 produces AtomicActions instead of Projects 2020-05-22 15:26:57 +02:00
Sandro La Bruzzo 72278b9375 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-22 15:17:13 +02:00
Sandro La Bruzzo 22936d0877 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-05-22 15:15:17 +02:00
Sandro La Bruzzo 9fbb221457 completed mapping of UnpayWall and ORCID 2020-05-22 15:15:09 +02:00
Miriam Baglioni 70389b0a30 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-22 13:53:23 +02:00
Miriam Baglioni 4308f31165 added fix to make test run 2020-05-22 13:13:01 +02:00
Claudio Atzori 946598cfba Merge branch 'master' into provision_indexing 2020-05-22 12:35:41 +02:00
Claudio Atzori 3cf2796ac6 code formatting 2020-05-22 12:34:00 +02:00
Michele Artini dc4621b3cb filter ORCID e MAG identifiers 2020-05-22 12:25:01 +02:00
Michele Artini 9f2d0f1b08 filter ORCID e MAG identifiers 2020-05-22 11:00:27 +02:00
Michele Artini 9de71e54a8 filter ORCID e MAG identifiers 2020-05-22 10:47:39 +02:00
Michele Artini c5f7e17348 author fullnames 2020-05-22 10:08:02 +02:00
Claudio Atzori ad40470040 Merge branch 'master' into provision_indexing 2020-05-22 08:51:22 +02:00
Claudio Atzori 925d933204 making XmlRecordFactory immune to graph encoding changes (mostly to avoid NPEs) 2020-05-22 08:50:44 +02:00
Claudio Atzori b33dd58be4 replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging 2020-05-22 08:50:06 +02:00
Michele Artini c7ca3cf35b Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-21 16:48:20 +02:00
Michele Artini 3e34517479 partial implementation of events with rels 2020-05-21 16:47:53 +02:00
Miriam Baglioni eae12a6586 Merge branch 'master' into dhp_oaf_model 2020-05-21 16:31:22 +02:00
Miriam Baglioni 6750075fbd merge upstream 2020-05-21 16:31:09 +02:00
Miriam Baglioni 4589c428b1 generate action sets and saves them in the hdfs path for the actions sets 2020-05-21 16:30:39 +02:00
miconis 8b35e0e7f0 reimplementation of the author merging in deduprecord creation. implementation of the test class. minor changes 2020-05-21 12:02:44 +02:00
miconis 8bbd1d0501 reimplementation of the author merging in deduprecord creation. implementation of the test class. 2020-05-21 11:52:14 +02:00
Michele Artini e43d4d7778 added a coalesce in sql query 2020-05-21 11:08:07 +02:00
Claudio Atzori dbfb9c19fe minor changes 2020-05-21 10:00:14 +02:00
Michele Artini b3bcbb3129 resolve name of organization countries 2020-05-21 08:41:32 +02:00
Enrico Ottonello 1109d3b3fc Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-21 00:41:27 +02:00
Enrico Ottonello 869a53040e save to text file format 2020-05-21 00:41:21 +02:00
Sandro La Bruzzo 5818abaab4 fixed Crossref Mapping 2020-05-20 17:05:46 +02:00
Claudio Atzori da4267d0fe Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-20 14:58:22 +02:00
Claudio Atzori d7d2a0637f added extra parameters to the provision indexing workflow 2020-05-20 14:55:38 +02:00
Miriam Baglioni 055eec5a77 added resource for prepare project test 2020-05-20 13:54:10 +02:00
Miriam Baglioni 9079bc1f61 - 2020-05-20 13:53:32 +02:00
Miriam Baglioni 67ba4fde57 added test for prepare projects step 2020-05-20 13:53:08 +02:00
Miriam Baglioni 5e0e554000 Merge branch 'master' into dhp_oaf_model 2020-05-20 10:57:30 +02:00
Miriam Baglioni 76f3f73caa merge upstream 2020-05-20 10:31:40 +02:00
Miriam Baglioni 3c0eb12d3e removed the not zipped files 2020-05-20 10:31:05 +02:00
Miriam Baglioni c0d9e02340 zipped test resources that are too big 2020-05-20 10:30:25 +02:00
Miriam Baglioni 5e9c9fa87c tests 2020-05-20 10:29:57 +02:00
Miriam Baglioni faed7521bf added resources for testing 2020-05-20 10:29:29 +02:00
Miriam Baglioni 75491482de added a new preparation step to replicate each project for the programme it is associated to 2020-05-20 10:28:56 +02:00
Miriam Baglioni eb0e47ba53 parameters for h2020 programme 2020-05-20 10:26:44 +02:00
Sandro La Bruzzo b771d67e9d next step of MAG conversion implemented 2020-05-20 08:14:03 +02:00
Miriam Baglioni 08218d2f3f new workflow with added steps 2020-05-19 18:44:25 +02:00
Miriam Baglioni 457293ccc0 test for the variuos steps of project update with programme 2020-05-19 18:43:42 +02:00
Miriam Baglioni 9447d78ef3 added preparation classes 2020-05-19 18:42:50 +02:00
Michele Artini 85ca5622d4 partial implementation of generation of simple events 2020-05-19 16:17:35 +02:00
Claudio Atzori 0bdfbb0a57 reintroduced RDD based relation cut off procedure 2020-05-19 15:02:21 +02:00
Enrico Ottonello 934ad570e0 joined summaries and activities dataset 2020-05-19 12:57:21 +02:00
Enrico Ottonello ca722d4d18 merged 2020-05-19 09:43:12 +02:00
Enrico Ottonello 7362bc3e9d workflow to generate seq(doi,AuthorList) 2020-05-19 09:34:44 +02:00
Sandro La Bruzzo 8c95b50f26 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-19 09:25:04 +02:00
Sandro La Bruzzo 486e850bcc next step of MAG conversion implemented 2020-05-19 09:24:45 +02:00
Enrico Ottonello d4e9075f22 Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-18 19:51:36 +02:00
Enrico Ottonello fc80e8c7de added accumulator; last modified date of the record is added to saved data; lambda file is partitioned into 20 parts before starting downloading 2020-05-18 19:51:29 +02:00
Claudio Atzori f3bc8aed31 lifted memory requirements for country propagation wf 2020-05-18 15:29:10 +02:00
Miriam Baglioni b71fbb68b1 removed the removeOutputDir command from code. Reltions are written in Append. The erase of the output dir ment to remove all the relations computed in the prevoius steps 2020-05-18 13:57:20 +02:00
Miriam Baglioni 629af7cb79 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-18 13:07:36 +02:00
Miriam Baglioni f0f14caf99 removed script files for shell actions not performed 2020-05-18 13:06:16 +02:00
Miriam Baglioni 23bbac7d7c - 2020-05-18 13:05:03 +02:00
Miriam Baglioni 4f1ff7ba73 added dependency to org.apache.commons common-csv 2020-05-18 13:04:39 +02:00
Miriam Baglioni abc45f2708 added dnet-45 HttpConnector and related Classes, produced the POJO for projects and programme 2020-05-18 13:04:06 +02:00
Claudio Atzori ef9a9a9f1a remove the outout path when starting 2020-05-15 22:34:19 +02:00
Enrico Ottonello 0b29bb7e3b spark job to download orcid record modified after a fixed date 2020-05-15 19:49:26 +02:00
Miriam Baglioni 5a648016ef parameters from the GetFile class 2020-05-15 18:18:50 +02:00
Miriam Baglioni 83c262a483 workflow to download the files 2020-05-15 18:18:31 +02:00
Miriam Baglioni 22cb9e0da7 simple code to get file from URL 2020-05-15 18:18:01 +02:00
Claudio Atzori 7838f2c63f init the empty list for author pids mapped from OAF 2020-05-15 17:06:01 +02:00
Claudio Atzori 82b615ab33 NPE check 2020-05-15 16:04:46 +02:00
Miriam Baglioni e26a67c3eb merge with upstream 2020-05-15 15:53:05 +02:00
Claudio Atzori 7a89507ab1 code formatting 2020-05-15 15:16:54 +02:00
Miriam Baglioni 5ec8c49ad5 removed serialization points 2020-05-15 12:49:58 +02:00
Claudio Atzori 1d35836a58 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-15 12:26:31 +02:00
Claudio Atzori cfc8948717 fixed mapping OdfToGraph: pick the correct element to map author pids and author affiliations; extended mapping Oaf2Graph: added support for author pids 2020-05-15 12:26:16 +02:00
Michele Artini 2a4e68a292 events recognition 2020-05-15 12:25:37 +02:00
Claudio Atzori a832658296 code formatting 2020-05-15 10:21:09 +02:00
Claudio Atzori 50d6a2ad3c added output directory removal in the blacklist spark actions; included common global properties in blacklist's workflow.xml 2020-05-15 09:53:37 +02:00
Claudio Atzori 18f46e47b9 added relations to the graph2hive import workflow 2020-05-15 09:34:48 +02:00
Claudio Atzori 9d028ffe1c cleanup 2020-05-15 09:28:55 +02:00
Claudio Atzori fd62359538 cleanup 2020-05-15 09:28:15 +02:00
Claudio Atzori eb64335a54 parallel implementation for graph Hive importer 2020-05-15 09:05:26 +02:00
Miriam Baglioni 94571c9a51 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-14 18:29:55 +02:00
Miriam Baglioni f25db01664 changed in the constant from propagationconstants to modelconstants 2020-05-14 18:29:24 +02:00
Miriam Baglioni d05630d979 removed the constants added in ModelConstants 2020-05-14 18:22:50 +02:00
Claudio Atzori f044d09315 revised mapping: more accurate mapping for name/surname from datacite format; improved mapping of null values 2020-05-14 15:07:24 +02:00
Miriam Baglioni e7eb4f377e Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-14 10:34:17 +02:00
Miriam Baglioni 8828458acf minor changes 2020-05-14 10:34:12 +02:00
Claudio Atzori ab37953332 added global properties in wf definitions to avoid repeating name-node and job-tracker in the (many) distcp actions; reintroduced output directory removal at the beginning of each spark action 2020-05-14 10:25:41 +02:00
Claudio Atzori 12bfa6702e Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-13 17:01:17 +02:00
Claudio Atzori 5ecacad70a fixed default resource typing in Oaf/Odf mapping 2020-05-13 17:01:11 +02:00
Enrico Ottonello 12756f9d41 multithread (4 threads) test to feed elastic search 2020-05-13 16:11:40 +02:00
Michele Artini c0265213a0 partial implementation 2020-05-13 12:00:27 +02:00
Sandro La Bruzzo a92ee0f41e Merge remote-tracking branch 'origin/master' into doiboost 2020-05-13 10:38:13 +02:00
Sandro La Bruzzo d876f47d06 next step of MAG conversion implemented 2020-05-13 10:38:04 +02:00
Claudio Atzori 1ddd33de41 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-13 09:04:41 +02:00
Claudio Atzori 85f3c55992 fixed node names in blacklist workflow 2020-05-13 09:04:33 +02:00
Miriam Baglioni 43f127448d changed the package name from dhp-propagation to dhp-enrichment for the preparation phase of funding propagation 2020-05-12 18:24:26 +02:00
Enrico Ottonello 08040cef80 spark action to analyze orcid lambda file 2020-05-12 16:57:43 +02:00
Claudio Atzori ec0782e582 renamed jar containing the bulktagging and propagation workflows from dhp-[bulktagging|propagation] to dhp-enrichment; adjusted xml formatting 2020-05-12 15:49:28 +02:00
Miriam Baglioni 1547ca7e15 added blacklist step to the end of the provision wf 2020-05-12 12:17:27 +02:00
Miriam Baglioni 14979f299e changed the configuration factory 2020-05-12 11:28:38 +02:00
Miriam Baglioni f8aef6161a minor modification 2020-05-12 11:28:07 +02:00
Miriam Baglioni 7387f3449a changed the route to find the verb resolver classes 2020-05-12 11:27:38 +02:00
Miriam Baglioni 7687519f00 merged conflicts with upstream branch 2020-05-12 10:03:44 +02:00
Miriam Baglioni 8ffc050b8a fixed problem in communityconfigurationfactory test 2020-05-12 10:01:09 +02:00
Claudio Atzori 527e8169a8 adjusted paths pointing to test configurations, cleanup 2020-05-11 18:17:05 +02:00
Claudio Atzori f9a62ba63b added wf nodes to copy entities to the output path 2020-05-11 18:16:39 +02:00
Miriam Baglioni ad63effb4e removed deletion of working dir 2020-05-11 17:48:22 +02:00
Claudio Atzori c6b028f2af code formatting 2020-05-11 17:38:08 +02:00
Claudio Atzori 6d0b11252e bulktagging wfs moved into common dhp-enrichment module 2020-05-11 17:32:06 +02:00
Miriam Baglioni 50659011eb refactoring 2020-05-11 16:14:26 +02:00
Miriam Baglioni e883daf87e added the outputPath parameter and the reset path to remove the outputath directory 2020-05-11 16:10:24 +02:00
Miriam Baglioni 5ab3424c77 removed unused dependencies 2020-05-11 16:09:37 +02:00
Miriam Baglioni 6a3b081263 added the last step of blacklisteing 2020-05-11 16:09:20 +02:00
Enrico Ottonello 3b1a68cbf5 elastic search feed test 2020-05-11 14:53:52 +02:00
Enrico Ottonello f53e42bda7 merged 2020-05-11 14:49:28 +02:00
Enrico Ottonello 7990894454 different date format in lambda file parsing 2020-05-11 14:41:11 +02:00
Sandro La Bruzzo 0c6774e4da updated pom version 2020-05-11 14:35:14 +02:00
Miriam Baglioni bbc9b4f329 removed unused imports 2020-05-11 14:28:55 +02:00
Miriam Baglioni 757bae53ea removed unusefule serialization points 2020-05-11 14:28:37 +02:00
Miriam Baglioni b35d57a1ac added resources for test 2020-05-11 14:15:30 +02:00
Miriam Baglioni e563e65335 moved check from join to method 2020-05-11 14:11:44 +02:00
Sandro La Bruzzo b90609848b Merge remote-tracking branch 'origin/master' into doiboost 2020-05-11 14:08:31 +02:00
Sandro La Bruzzo 4062eafbdb merged from branch 2020-05-11 14:08:16 +02:00
Miriam Baglioni f5d785e096 used the DbClient moved in dhp-common 2020-05-11 13:59:42 +02:00
Miriam Baglioni 112b2cb3c3 added the test class 2020-05-11 13:58:58 +02:00
Miriam Baglioni 9a7ae523c9 update to version 1.2.1-SNAPSHOT 2020-05-11 13:57:47 +02:00
Miriam Baglioni 2abb84877d Merge branch 'master' into blacklist 2020-05-11 10:37:49 +02:00
Miriam Baglioni b0f0b24263 update to version 1.2.1-SNAPSHOT 2020-05-11 10:37:31 +02:00
Miriam Baglioni a7e91e23ba update to versione 1.2.1-SNAPSHOT 2020-05-11 10:34:30 +02:00
Miriam Baglioni bb59bdd60f merge upstream 2020-05-11 10:33:17 +02:00
Miriam Baglioni 5e3548add6 - 2020-05-11 10:33:08 +02:00
Miriam Baglioni dc8c8fa480 changed the version 2020-05-11 10:20:48 +02:00
Miriam Baglioni 871e079b45 merged with master 2020-05-11 10:20:00 +02:00
Claudio Atzori 60c40618d3 [maven-release-plugin] prepare for next development iteration 2020-05-11 10:17:14 +02:00
Claudio Atzori c267d958d5 [maven-release-plugin] prepare release dhp-1.2.0 2020-05-11 10:17:10 +02:00
Miriam Baglioni 622ba87ec2 changed the version 2020-05-11 10:10:36 +02:00
Miriam Baglioni 391b2399cc merge upstream 2020-05-11 10:08:51 +02:00
Claudio Atzori 42f1a2bf94 bumped project version to 1.2.0-SNAPSHOT 2020-05-11 10:05:57 +02:00
Sandro La Bruzzo 1412158a6f merged from branch 2020-05-11 09:45:50 +02:00
Miriam Baglioni 32301451ec merge upstream 2020-05-11 09:42:23 +02:00
Miriam Baglioni 7e66bc2527 fix a typo in the compression keyword and added some logging info in the spark job 2020-05-11 09:40:58 +02:00
Sandro La Bruzzo 1662f221f5 added test class 2020-05-11 09:39:11 +02:00
Sandro La Bruzzo 2b48a2c32c Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-05-11 09:38:36 +02:00
Sandro La Bruzzo 4cebca09d2 start implementing MAG mapping 2020-05-11 09:38:27 +02:00
Spyros Zoupanos ae0f535c73 Fixing hardcoded reference to main openAIRE graph db 2020-05-09 22:34:48 +03:00
Claudio Atzori fd519df616 new rels produced by dedup workflow must be unique 2020-05-08 19:00:38 +02:00
Claudio Atzori 0ccc864ad9 [maven-release-plugin] prepare for next development iteration 2020-05-08 17:01:31 +02:00
Claudio Atzori 6e47c724c6 [maven-release-plugin] prepare release dhp-1.1.7 2020-05-08 17:01:27 +02:00
Claudio Atzori 5b28bb4131 code formatting 2020-05-08 16:49:47 +02:00
Claudio Atzori 8fd1952f16 code formatting 2020-05-08 16:01:09 +02:00
miconis 3420998bb4 reltype set in mergerels 2020-05-08 15:43:30 +02:00
Enrico Ottonello b9d126dd1f formatting modified after commit 2020-05-08 14:54:37 +02:00
Enrico Ottonello 7e1c987370 Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-08 14:49:50 +02:00
Enrico Ottonello 9d812788e4 added job to download from orcid the records modified after a fixed date, the info are taken from last_modified.csv on hdfs 2020-05-08 14:49:39 +02:00
Miriam Baglioni 9a29ab7508 got back to the readPath we have before 2020-05-08 13:08:56 +02:00
Miriam Baglioni 28556507e7 - 2020-05-08 12:54:52 +02:00
Claudio Atzori b2192fdcdc simplified reset_outputpath nodes across the workflows, applied common xml formatting 2020-05-08 12:33:31 +02:00
Miriam Baglioni 4c94231cad merge with master fork 2020-05-08 12:25:57 +02:00
Miriam Baglioni 9b4c0d4b3a - 2020-05-08 11:51:45 +02:00
Miriam Baglioni 53952707b6 modified test because of new step of data preparation. It now expects to find ResultCountrySet serialization nstead of DatasourceCountry 2020-05-08 11:49:19 +02:00
Claudio Atzori 62ea19f1d3 introduced mapping for ExternalReferences, made urls defined within an instance unique 2020-05-08 09:43:26 +02:00
Claudio Atzori 8c67073a07 force speculative execution to false 2020-05-08 09:42:21 +02:00
Miriam Baglioni d6b9de9f46 Merge branch 'master' of https://code-repo.d4science.org/miriam.baglioni/dnet-hadoop 2020-05-07 18:22:59 +02:00
Miriam Baglioni f95d288681 fixed swithch of parameters 2020-05-07 18:22:32 +02:00
Claudio Atzori 166aafd936 heavy cleanup 2020-05-07 18:22:26 +02:00
Michele Artini ac0da5a7ee Partial implementation of broker events 2020-05-07 12:31:26 +02:00
Miriam Baglioni fb405275f7 merged with master 2020-05-07 11:48:21 +02:00
Miriam Baglioni e124278934 - 2020-05-07 11:47:11 +02:00
Claudio Atzori 5111671e62 celanup 2020-05-07 11:47:00 +02:00
Miriam Baglioni 9f8855991c changed Encorders.bean to Encoders.kryo 2020-05-07 11:44:35 +02:00
Miriam Baglioni 207b899d6d merged with upstream 2020-05-07 11:43:53 +02:00
Claudio Atzori 5b3f8a0e90 using Encoders.bean instead of kryo 2020-05-07 11:41:41 +02:00
Miriam Baglioni 182225becb Merge branch 'master' of https://code-repo.d4science.org/miriam.baglioni/dnet-hadoop 2020-05-07 11:38:17 +02:00
Miriam Baglioni 5efae3acb9 new workflow for job3 2020-05-07 11:38:10 +02:00
Claudio Atzori 73243793b2 Dataset based implementation for SparkCountryPropagationJob3 2020-05-07 11:15:24 +02:00
Claudio Atzori 128c3bf1c8 restored Author bean with simple getter/setter, author pid addition moved into dedicated implementation SparkOrcidToResultFromSemRelJob3 2020-05-07 11:14:56 +02:00
Miriam Baglioni b2fec32c87 new workflow for job3 2020-05-07 10:01:57 +02:00
Miriam Baglioni 29bc8c44b1 changes in the construction of new country set 2020-05-07 10:01:34 +02:00
Miriam Baglioni 55e825acd4 chenged the test according to changes in SparkCOuntryPropagationJob2 2020-05-07 10:01:00 +02:00
Miriam Baglioni 16193cf0ba new workflow and parameter for country propagation 2020-05-07 09:59:58 +02:00
Miriam Baglioni 5a476c7a13 chenged the xquery for the cfhb table 2020-05-07 09:58:17 +02:00
Miriam Baglioni 42ad51577a new implementation with one more serialization step 2020-05-07 09:57:49 +02:00
Claudio Atzori 17860d3ab6 general changes in the RAW graph mapping: missing collectedfrom/hostedby causes records to be skipped; factored out most of the constants in ModelConstants class (dhp-schemas) 2020-05-06 13:20:02 +02:00
Claudio Atzori fdfecc9578 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-06 11:28:01 +02:00
Claudio Atzori c79e2f5977 drop workingPath before starting the dedup workflow 2020-05-06 11:27:44 +02:00
Michele Artini 8f30a09d84 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-05 17:12:22 +02:00
Michele Artini ccc609f909 new module for the production of broker events 2020-05-05 17:09:00 +02:00
Miriam Baglioni dd2e698a72 added a sequentialization step on the spark job. Addedd new parameter 2020-05-05 17:03:43 +02:00
Claudio Atzori 0825321d0b improved unit tests in dhp-aggregation 2020-05-05 12:39:04 +02:00
Miriam Baglioni 252b219dd5 chanced the name of some properties 2020-05-05 10:03:32 +02:00
Claudio Atzori 4a8487165c using long param names in wf definition 2020-05-04 19:19:29 +02:00
Claudio Atzori a2fc37df5f adjusted parameters 2020-05-04 19:18:59 +02:00
Claudio Atzori f1b7e14036 code formatting 2020-05-04 19:18:34 +02:00
Miriam Baglioni 78578c3ccf fixed wrong trnasition name in workflow 2020-05-04 15:46:24 +02:00
Miriam Baglioni cc7d9b6b19 merge upstream 2020-05-04 13:59:09 +02:00
Miriam Baglioni 3957c815b9 changed the name of some parameters 2020-05-04 13:58:52 +02:00
Miriam Baglioni e218360f8a changed code for the mode of DbClient and also removed the dependency to graph-mapper 2020-05-04 12:26:17 +02:00
Miriam Baglioni 31ea05297d moved the DbClient to common and added needed dependency to pom 2020-05-04 12:22:28 +02:00
miconis 085cf173d7 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-04 12:08:20 +02:00
miconis 3df703f67d mergerels added to propagate relations 2020-05-04 12:08:12 +02:00
Claudio Atzori bac37b3973 fixed children expansion in XML records 2020-05-04 11:51:17 +02:00
Claudio Atzori 077ccd8743 stats wf properties cleanup 2020-05-04 11:41:46 +02:00
Miriam Baglioni b7dd400e51 added check if author.pid exists or is null 2020-05-01 15:09:02 +02:00
Miriam Baglioni dbf3ba051a minor 2020-04-30 20:22:07 +02:00
Miriam Baglioni 43053a286d workflow pom with added blacklist module 2020-04-30 18:30:21 +02:00
Miriam Baglioni 0631fe548a pom.xml 2020-04-30 18:29:46 +02:00
Miriam Baglioni 38ecfd5785 the wf with all the three steps for blacklisting relations 2020-04-30 18:28:46 +02:00
Miriam Baglioni 95433e1087 parameters for the preparation phase and blacklist phase 2020-04-30 18:28:13 +02:00
Miriam Baglioni 1070790c19 minor 2020-04-30 18:26:58 +02:00
Miriam Baglioni b9d56b3ced applies the actual removal of the relations 2020-04-30 18:26:25 +02:00
Miriam Baglioni d6d6ebeae5 preparation step: creates the subset of the merges relations 2020-04-30 18:25:33 +02:00
Miriam Baglioni 13f30664ea minor 2020-04-30 15:23:49 +02:00
Miriam Baglioni 276b95b7b3 add create file instruction 2020-04-30 15:05:17 +02:00
Miriam Baglioni 65a5d67b8b minor modifications 2020-04-30 14:45:27 +02:00
Miriam Baglioni 418595fec2 removed the saveGraph parameter 2020-04-30 14:45:00 +02:00
Miriam Baglioni ce8b1d0bc3 new workflow definition to be inserted in the provision pipeline 2020-04-30 14:38:54 +02:00
Miriam Baglioni 4b0bd91012 - 2020-04-30 12:45:28 +02:00
Miriam Baglioni 2349bfd8b8 changed the job test to remove the writeUpdate option 2020-04-30 11:43:33 +02:00
Sandro La Bruzzo 1e06bbaee8 fixed test 2020-04-30 11:38:58 +02:00
Miriam Baglioni 951517f9ec new input parameters and workflow definition to be used in the provision pipeline 2020-04-30 11:32:50 +02:00
Miriam Baglioni 026f297e49 removed the writeUpdate oprion 2020-04-30 11:31:59 +02:00
Sandro La Bruzzo b8e95295e2 merged from master 2020-04-30 11:27:59 +02:00
Miriam Baglioni c89fe762b1 modified relation datasource organization 2020-04-30 11:17:03 +02:00
Miriam Baglioni 3abb76ff7a merge with upstream 2020-04-30 11:15:54 +02:00
Michele Artini eb9bd42970 fixed a problem with journals 2020-04-30 11:06:05 +02:00
Miriam Baglioni 638a3c465b - 2020-04-30 11:05:17 +02:00
Michele Artini a0a6109bbc fixed a problem with journals 2020-04-30 11:03:46 +02:00
Miriam Baglioni 354f0162be changes in the blacklist and workflow definition 2020-04-30 10:26:50 +02:00
Claudio Atzori 439c6255a2 cleanup 2020-04-29 19:09:07 +02:00
Claudio Atzori 77ac995770 cleaned up poms, added descriptions 2020-04-29 18:44:17 +02:00
Miriam Baglioni 3cffee74b9 merge with upstream 2020-04-29 18:25:29 +02:00
Miriam Baglioni 9ab46535e7 pom with the new blacklist module added 2020-04-29 18:17:15 +02:00
Miriam Baglioni 6a47e6191d read from blacklist and write the result as relations on hdfs 2020-04-29 18:16:01 +02:00
Miriam Baglioni 869f576273 added hash map for relationship entityType id prefix, and relation inverse 2020-04-29 18:14:52 +02:00
Miriam Baglioni b85ad7012a reads the blacklist from the blacklist db and writes it as a set of relations on hdfs 2020-04-29 17:29:49 +02:00
Claudio Atzori 8fd81e863d added default value for the external_stats_db_name 2020-04-29 15:36:24 +02:00
Claudio Atzori c6f3ff4462 stats workflow content relocated into common package; added <global> property definitions in stats workflow.xml 2020-04-29 14:29:27 +02:00
Sandro La Bruzzo 4a89465740 reformatted code 2020-04-29 13:24:29 +02:00
Sandro La Bruzzo a6b1a59d0a merged with maaster 2020-04-29 13:20:57 +02:00
Sandro La Bruzzo 920c0f19c3 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-04-29 13:13:16 +02:00
Sandro La Bruzzo 09f161f1f4 implemented unit test 2020-04-29 13:13:02 +02:00
miconis e0d14fe4f8 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-29 13:02:53 +02:00
miconis 0352d3b0ba entity dumps in dedup compressed 2020-04-29 13:02:34 +02:00
Michele Artini c43b4c8962 formatting 2020-04-29 12:56:58 +02:00
Michele Artini a5d7007005 Fix relations in migration
Fix pom.xml in dhp-stats-update
2020-04-29 12:05:41 +02:00
Miriam Baglioni f7695e833c resolved conflicts 2020-04-29 11:41:31 +02:00
Claudio Atzori 3616d0f88d Merge pull request 'Adding the stats workflow to the dnet-hadoop hierarchy' (#6) from spyros/dnet-hadoop:master into master
Integrating stats update workflow.
2020-04-29 10:35:02 +02:00
Claudio Atzori 964972d29a added data provision workflow definition WIP 2020-04-29 09:25:50 +02:00
Enrico Ottonello 1edcd53581 added shell actions to download all 11 activities files from ORCID 2020-04-28 20:25:09 +02:00
miconis 62e467eb0c assertion numbers updated to fit the new implementation of the pace-core 2020-04-28 11:46:23 +02:00
Claudio Atzori 6f5b899038 reformatted code according to the updated style descriptor 2020-04-28 11:23:29 +02:00
Claudio Atzori ac25f2d8d1 integrated changes from master 2020-04-28 08:55:28 +02:00
Miriam Baglioni 2980e50edf merge upstream 2020-04-27 15:06:48 +02:00
Claudio Atzori a0bdbacdae switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:52:31 +02:00
Claudio Atzori 7a3f8085f7 switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:45:40 +02:00
Michele Artini 1260d03eba skip empty projects 2020-04-27 13:51:13 +02:00
Miriam Baglioni df34a4ebcc changed the configuration to add ignorecase option to each verb related to covid-19 community 2020-04-27 12:32:56 +02:00
Miriam Baglioni 7a59324ccf changed the test to check for the new ignorecase option 2020-04-27 12:31:46 +02:00
Miriam Baglioni 986c97348d added the ignorecase option to each selection verb 2020-04-27 12:31:05 +02:00
Miriam Baglioni a303fc9f73 resources for testing propagation of result to comminuty from organization and from semrel 2020-04-27 11:14:16 +02:00
Miriam Baglioni c093d764a3 - 2020-04-27 11:12:38 +02:00
Miriam Baglioni c925e2be16 test for propagation of result to community from organization and result to community from semrel 2020-04-27 10:59:53 +02:00
Miriam Baglioni ec7f166690 changed the bl because of changed of the examples for the re implementation of the propagation step 2020-04-27 10:58:41 +02:00
Miriam Baglioni 6135096ef1 refactoring 2020-04-27 10:57:50 +02:00
Miriam Baglioni d30e710165 fixed duplicates action name in the workflow 2020-04-27 10:52:30 +02:00
Miriam Baglioni f9ee343fc0 new parametrized workflow with preparation steps and new parameter input files 2020-04-27 10:48:31 +02:00
Miriam Baglioni e2093644dc changed in the workflow the directory where to store the preparedInfo and the graph genearated at this step 2020-04-27 10:46:44 +02:00
Miriam Baglioni 8a58bf2744 removed the writeUpdate option 2020-04-27 10:45:06 +02:00
Miriam Baglioni 5dccbe13db merge with upstream 2020-04-27 10:43:59 +02:00
Miriam Baglioni 7b6505ec69 new resuorces for testing propagation of project to result after the re-implementation 2020-04-27 10:42:16 +02:00
Miriam Baglioni 1b0e0bd1b5 refactoring 2020-04-27 10:40:26 +02:00
Miriam Baglioni e5a177f0a7 refactoring 2020-04-27 10:36:21 +02:00
Miriam Baglioni e000754c92 refactoring 2020-04-27 10:34:03 +02:00
Miriam Baglioni 95a54d5460 removed the writeUpdate option. The update is available in the preparedInfo path 2020-04-27 10:30:32 +02:00
Miriam Baglioni 8802e4126b re-implemented inverting the couple: from (projectId, relatedResultList) to (resultId, relatedProjectList) 2020-04-27 10:26:55 +02:00
Enrico Ottonello a1861b9eaa workflow works in parallel on 2 activity files 2020-04-24 18:33:37 +02:00
Enrico Ottonello 941e94af06 added workflow for generating authors with dois data sequence file 2020-04-24 15:50:40 +02:00
Claudio Atzori 268462623a refined definition of equals and hash methods for Oaf model classes, now based on entity identifier, while relations consider sourceid, targetid and relationship semantic; Factored out function to group Oaf objects in grouping operations; Raw graph creation procedure merges entities and relationships providing the same identity 2020-04-24 14:42:01 +02:00
Claudio Atzori a3e480d1c9 implmented DispatchEntitiesApplication using spark2 datasets 2020-04-24 14:36:53 +02:00
Claudio Atzori 48157e0fc4 GraphHiveImporterJob moved in dedicate package 2020-04-24 14:32:28 +02:00
Miriam Baglioni adcbf0e29a refactoring 2020-04-24 10:47:43 +02:00
Claudio Atzori 278fc9d276 code formatting 2020-04-23 18:51:38 +02:00
miconis 5414236644 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-23 18:17:23 +02:00
miconis 8d258c85ff spark dedup test fixed, sample for dataset and orp added, test implemented 2020-04-23 18:16:20 +02:00
Michele Artini 072eae3803 fixed a problem with missing contexts 2020-04-23 16:42:49 +02:00
Michele Artini b164d96874 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-23 16:19:16 +02:00
Michele Artini d920ce501e fixed a problem with missing instances 2020-04-23 16:18:40 +02:00
Miriam Baglioni 0e447add66 removed unuseful classes 2020-04-23 12:59:43 +02:00
Miriam Baglioni edb00db86a refactoring 2020-04-23 12:57:35 +02:00
Miriam Baglioni 44fab140de - 2020-04-23 12:42:07 +02:00
Miriam Baglioni 769aa8178a refactoring 2020-04-23 12:40:44 +02:00
Miriam Baglioni d8dc31d4af refactoring 2020-04-23 12:35:49 +02:00
Miriam Baglioni 8c5dac5cc3 removed unuseful classes 2020-04-23 12:30:58 +02:00
Miriam Baglioni 15656684b9 added proeprties for the preparation step and actual propagation. Added the new parametrized workflow 2020-04-23 12:13:34 +02:00
Miriam Baglioni 6f35f5ca42 added the steps of reset output dir and copy information not changed by the propagation step 2020-04-23 12:12:07 +02:00
Miriam Baglioni 19cd5b85c0 changed the classname to execute 2020-04-23 12:07:41 +02:00
Miriam Baglioni fa2ff5c6f5 refactoring 2020-04-23 11:58:26 +02:00
Miriam Baglioni 540f70298b added missing property 2020-04-23 11:51:48 +02:00
Miriam Baglioni e431fe4f5b added the implements Serializable to each class 2020-04-23 11:48:47 +02:00
Miriam Baglioni 24fa81d7e8 implementation parametrized for result type 2020-04-23 11:44:19 +02:00
Miriam Baglioni ab2a24cc2b changed the dependency to use reflections to find annotated classes 2020-04-23 11:08:47 +02:00
Miriam Baglioni 5153d88bd3 defiition of workflow and properties for bulktagging 2020-04-23 11:04:53 +02:00
Miriam Baglioni 3b2e4ab670 test for bulktag 2020-04-23 10:00:10 +02:00
Sandro La Bruzzo fdc0523e4c Merge remote-tracking branch 'origin/master' into doiboost 2020-04-23 09:34:13 +02:00
Sandro La Bruzzo 4ba386d996 improved crossref mapping 2020-04-23 09:33:48 +02:00
Claudio Atzori 8851050814 replaced hive_db_name with hiveDbName 2020-04-23 08:36:40 +02:00
Claudio Atzori 91f81107b1 applying code formatting 2020-04-23 07:52:32 +02:00
Claudio Atzori 1e7583c5a6 filtered invisible records in data provision workflow 2020-04-23 07:51:34 +02:00
Claudio Atzori 9ddafd46ca fixed dedup record id prefix, set the correct dataInfo in the DedupRecordFactory 2020-04-23 07:50:18 +02:00
Claudio Atzori ade4cb97af fixed parameters passed to the postprocessing action in the workflow mapping the graph as hive DB 2020-04-22 18:24:06 +02:00
Sandro La Bruzzo bb6c9785b4 Merge remote-tracking branch 'origin/master' into doiboost 2020-04-22 15:00:57 +02:00
Sandro La Bruzzo 157915988c improved crossref mapping 2020-04-22 15:00:44 +02:00
Enrico Ottonello 5977f08e92 merged 2020-04-22 14:50:50 +02:00
Enrico Ottonello 7d759947ae used vtd for parsing orcid xml record, set 4g heapspace 2020-04-22 14:41:19 +02:00
Claudio Atzori e81960335c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-04-22 10:46:37 +02:00
Michele Artini 9e4d58f505 ResultType 2020-04-22 10:07:26 +02:00
Claudio Atzori c891661822 small adjustments in the graph2hive workflow 2020-04-21 18:52:23 +02:00
Miriam Baglioni 259525cb93 Merge remote-tracking branch 'upstream/master' 2020-04-21 18:33:46 +02:00
Miriam Baglioni 30e53261d0 minor 2020-04-21 18:00:53 +02:00
Claudio Atzori 0b55795d4d small adjustments in the provisioning workflow 2020-04-21 16:15:04 +02:00
Claudio Atzori 88fbb3a353 added sparkSqlWarehouseDir to the default extra spark options passed to each workflow 2020-04-21 16:13:43 +02:00
Claudio Atzori cd320efa96 added extra spark options to graph to hive workflow 2020-04-21 16:12:20 +02:00
Miriam Baglioni 90c768dde6 added shaded libs module 2020-04-21 16:03:51 +02:00
Claudio Atzori 91e72a6944 Dataset based implementation for SparkCreateDedupRecord phase, fixed datasource entity dump supplementing dedup unit tests 2020-04-21 12:06:08 +02:00
miconis 5c9ef08a8e spark dedup test fixed 2020-04-21 10:19:04 +02:00
Sandro La Bruzzo 3624947a7f Merge remote-tracking branch 'origin/master' into doiboost 2020-04-21 08:34:24 +02:00
Claudio Atzori d772d967aa restored changes from master branch 2020-04-20 18:53:06 +02:00
Claudio Atzori eb8a020859 fixed behaviour of DedupRecordFactory 2020-04-20 18:44:06 +02:00
Sandro La Bruzzo 039f9b7871 Merge remote-tracking branch 'origin/master' into doiboost 2020-04-20 18:10:29 +02:00
Sandro La Bruzzo e4b105cece improved crossref mapping 2020-04-20 18:10:07 +02:00
Claudio Atzori ede1af3d85 Merge branch 'master' into deduptesting 2020-04-20 16:52:14 +02:00
miconis 1102e32462 SparkDedupTest updated and organization dump fixed 2020-04-20 16:49:01 +02:00
Claudio Atzori 667d23c58b finalising Actionset migration workflow 2020-04-20 16:45:21 +02:00
miconis 4da13e4570 Revert "Merge branch 'master' into deduptesting"
This reverts commit 772f75d167, reversing
changes made to 5f45f2c77f.
2020-04-20 16:04:49 +02:00
Claudio Atzori 9147af7fed actionsets migration workflow moved in dhp-workflows/dhp-actionmanager 2020-04-20 15:24:33 +02:00
miconis 772f75d167 Merge branch 'master' into deduptesting 2020-04-20 14:50:12 +02:00
Sandro La Bruzzo 5d46ec7d5f fixed name of wrong package 2020-04-20 14:49:32 +02:00
Sandro La Bruzzo 82cc3b707d fixed name of wrong package 2020-04-20 14:47:06 +02:00
Sandro La Bruzzo b2c872cb4d merged master 2020-04-20 14:04:40 +02:00
Sandro La Bruzzo 7029942e06 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-04-20 13:26:41 +02:00
Sandro La Bruzzo 0e45f4d450 continue mapping from crossref to OAF 2020-04-20 13:26:29 +02:00
Enrico Ottonello a466648b4b renamed output file 2020-04-20 12:32:03 +02:00
Claudio Atzori d714bfb4d4 collectedfrom field moved in common parent class Oaf.java 2020-04-20 12:25:19 +02:00
Enrico Ottonello 4ae55e3891 added workflow parameters 2020-04-20 12:00:04 +02:00
Michele Artini 8ff7facfa3 fixed collectedFrom ID 2020-04-20 11:09:27 +02:00
Sandro La Bruzzo eef60bb9f4 created structure of oozie wf for ORCID 2020-04-20 10:24:57 +02:00
Sandro La Bruzzo 4d0d9de07e reorganized package and fixed test 2020-04-20 10:02:42 +02:00
Sandro La Bruzzo 618bc1fc72 first implementation of crossrefMapping 2020-04-20 09:53:34 +02:00
Michele Artini 25307965d2 add a default datainfo if missing 2020-04-20 09:43:27 +02:00
Michele Artini d2058fdc47 tests 2020-04-20 09:31:14 +02:00
Enrico Ottonello 1d44a359ea renamed package folder 2020-04-20 09:25:40 +02:00
Michele Artini 478a958f09 tests 2020-04-20 09:15:27 +02:00
Miriam Baglioni e1848b7603 minor 2020-04-18 14:16:42 +02:00
Miriam Baglioni 0ff9b1ef05 added needed parameter 2020-04-18 14:16:29 +02:00
Miriam Baglioni e2dfe8b656 removed not used action 2020-04-18 14:16:07 +02:00
Miriam Baglioni 437ebbad76 refactorign 2020-04-18 14:15:09 +02:00
Miriam Baglioni 9a8876ac86 added needed parameter 2020-04-18 14:14:08 +02:00
Miriam Baglioni 9854852878 refactoring 2020-04-18 14:13:16 +02:00
Miriam Baglioni 454b8a6a29 Merge remote-tracking branch 'upstream/master' 2020-04-18 14:09:44 +02:00
Miriam Baglioni 890ec28f0f input parameters for preparation step1 2020-04-18 14:09:37 +02:00
Miriam Baglioni fbf5c27c27 Added preparation classes before actual propagation 2020-04-18 14:09:03 +02:00
Claudio Atzori 5f45f2c77f Merge branch 'master' into deduptesting 2020-04-18 12:46:40 +02:00
Claudio Atzori ad7a131b18 introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin, applied to each java class in the project 2020-04-18 12:42:58 +02:00
Claudio Atzori a2938dd059 cleanup 2020-04-18 12:24:22 +02:00
Claudio Atzori 9374ff03ea Merge branch 'master' into deduptesting 2020-04-18 12:06:58 +02:00
Claudio Atzori 71813795f6 various refactorings on the dnet-dedup-openaire workflow 2020-04-18 12:06:23 +02:00
Enrico Ottonello 7011d4203e parser of orcid summaries from tar gz file on hdfs, that creates a sequence file with authors informations (oid, name, surname, credit name) 2020-04-17 18:52:39 +02:00
miconis 6450bb0daa test for softwares dedup added. definition of orp, dataset and sw dedup configurations 2020-04-17 17:31:59 +02:00
Miriam Baglioni 72c63a326e removed unuseful class 2020-04-17 17:14:51 +02:00
Miriam Baglioni 00c2ca3ee5 - 2020-04-17 17:14:25 +02:00
Miriam Baglioni 5cd092114f use mergeFrom method to add the new community contexts 2020-04-17 17:13:18 +02:00
Miriam Baglioni 264c82f21e minor 2020-04-17 16:54:46 +02:00
Miriam Baglioni 8c079c7a49 unit test for orcid to result propagation from semrel 2020-04-17 16:53:03 +02:00
Miriam Baglioni eacd140a98 added missing parameter(s) 2020-04-17 16:52:30 +02:00
Miriam Baglioni 390e250faf use the addPid method of the Author class to add a new pid 2020-04-17 16:52:02 +02:00
Miriam Baglioni b46b080ddc use mergeFrom method call to add the country(ies) instead of modify the result directly. 2020-04-17 16:50:54 +02:00
Miriam Baglioni c4987dd12a minor 2020-04-17 16:49:08 +02:00
Claudio Atzori 038ac7afd7 relation consistency workflow separated from dedup scan and creation of CCs 2020-04-17 13:12:44 +02:00
Claudio Atzori c92bfeeaee Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-04-17 13:07:52 +02:00
Miriam Baglioni adc11c97a7 Merge remote-tracking branch 'upstream/master' 2020-04-17 12:34:31 +02:00
Sandro La Bruzzo a329ea5575 merged with master branch 2020-04-17 12:23:54 +02:00
Sandro La Bruzzo 01ea7721f3 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-17 12:12:25 +02:00
Sandro La Bruzzo 5e2fa996aa fixed problem with conversion of long into string 2020-04-17 12:11:51 +02:00
miconis 418cf94642 implementation of the deletedbyinference test in propagating relations 2020-04-17 10:40:21 +02:00
Miriam Baglioni 5d772e5263 new implementation of propagation of community to result from organization that exploits the prepared info 2020-04-16 18:45:22 +02:00
Miriam Baglioni fff1e5ec39 classes to (de)serialize the data provided in the preparation step 2020-04-16 18:44:43 +02:00
Miriam Baglioni 3fd9d6b02f preparation phase for the propagation of community to result from organization 2020-04-16 18:43:55 +02:00
Miriam Baglioni a9120164aa added hive parameter and a step of reset of the working dir in the workflow 2020-04-16 18:42:04 +02:00
Miriam Baglioni 6afbd542ca changed the save mode to avoid NegativeArraySize... error. Needed to modify also the preparationstep2 2020-04-16 18:40:14 +02:00
Miriam Baglioni d60fd36046 changed the save method 2020-04-16 16:14:15 +02:00
Miriam Baglioni 951b13ac46 input parameters and workflow for new implementation of propagation of orcid to result from semrel and preparation phases 2020-04-16 16:13:10 +02:00
Miriam Baglioni 4d89f3dfed removed unuseful classes 2020-04-16 16:11:44 +02:00
Miriam Baglioni 5e72a51f11 - 2020-04-16 16:11:20 +02:00
Miriam Baglioni c33a593381 renamed 2020-04-16 16:09:47 +02:00
Miriam Baglioni 0e5399bf74 seconf phase of data preparation. Groups all the possible updates by id 2020-04-16 16:08:51 +02:00
Miriam Baglioni 548ba915ac first phase of data preparation. For each result type (parallel) it produces the possible updates 2020-04-16 15:58:42 +02:00
Miriam Baglioni 243013cea3 to (de)serialize the association from the resultId and the list of autoritative authors with orcid to possibly propagate 2020-04-16 15:57:29 +02:00
Miriam Baglioni ac3ad25b36 to (de)serialize needed information of the author to determine if the orcid can be passed (name, surname, fullname (?), orcid) 2020-04-16 15:56:33 +02:00
Miriam Baglioni d6cd700a32 new implementation that exploits prepared information (the list of possible updates: resultId - possible list of orcid to be added 2020-04-16 15:55:25 +02:00
Miriam Baglioni f077f22f73 minor 2020-04-16 15:54:16 +02:00
Miriam Baglioni fd5d792e35 refactoring 2020-04-16 15:53:34 +02:00
Claudio Atzori cb0952428e Merge branch 'master' into deduptesting 2020-04-16 14:42:25 +02:00
Claudio Atzori cc21bbfb1a Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 14:41:37 +02:00
Claudio Atzori ec5dfc068d added spark.sql.shuffle.partitions=3840 to dedup scan wf 2020-04-16 14:41:28 +02:00
Claudio Atzori 09f356b047 Merge pull request 'Closes #7: subdirs inside graph table dirs' (#8) from przemyslaw.jacewicz/dnet-hadoop:przemyslawjacewicz_7_distcp_configuration_fix into master
Run the code from this PR in isolation and it worked fine. Thanks!
2020-04-16 14:38:46 +02:00
Claudio Atzori 3437383112 Merge branch 'master' into deduptesting 2020-04-16 12:46:14 +02:00
miconis 0eccbc318b Deduper class (utilities for dedup) cleaned. Useless methods removed 2020-04-16 12:36:37 +02:00
Claudio Atzori 76d23895e6 Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 12:18:32 +02:00
miconis 6a089ec287 minor changes 2020-04-16 12:15:38 +02:00
Claudio Atzori 376efd67de removed prepare statement in spark action 2020-04-16 12:14:16 +02:00
miconis 9b36458b6a Merge branch 'deduptesting' of code-repo.d4science.org:D-Net/dnet-hadoop into deduptesting 2020-04-16 12:13:58 +02:00
miconis cd4d9a148f creating temporary directories in dedup test 2020-04-16 12:13:26 +02:00
Claudio Atzori b39ff36c16 improving the wf definitions 2020-04-16 12:11:37 +02:00
Claudio Atzori 011b342bc9 trying to avoid OOM in SparkPropagateRelation 2020-04-16 11:13:51 +02:00
Miriam Baglioni 08227cfcbd resources needed for running the test on propagation of result to organization from institutional repositories 2020-04-16 11:06:10 +02:00
Miriam Baglioni a97e915c24 test unit for propagation of result to organization from institutional repository 2020-04-16 11:05:21 +02:00
Miriam Baglioni b078710924 modification to the test due to the removal of unused parameters 2020-04-16 11:04:39 +02:00
Miriam Baglioni a5e5c81a2c input parameters and workflow definition for propagation of result to organization from institutional repositories 2020-04-16 11:03:41 +02:00
Miriam Baglioni 5e1bd67680 removed unuseful parameter 2020-04-16 11:02:01 +02:00
Miriam Baglioni eaf19ce01b removed unuseful class 2020-04-16 10:59:33 +02:00
Miriam Baglioni 7bd49abbef commit to delete 2020-04-16 10:59:09 +02:00
Miriam Baglioni 53f418098b added the isTest checkpoint 2020-04-16 10:53:48 +02:00
Miriam Baglioni c28333d43f minor 2020-04-16 10:52:50 +02:00
Miriam Baglioni a8100baed6 changed the way to save the results to aviod NegativeArray... error 2020-04-16 10:50:09 +02:00
Miriam Baglioni 79b978ec57 refactoring 2020-04-16 10:48:41 +02:00
Claudio Atzori 069ef5eaed trying to avoid OOM in SparkPropagateRelation 2020-04-15 21:23:21 +02:00
Claudio Atzori 8eedfefc98 try to introduce intermediate serialization on hdfs to avoid OOM 2020-04-15 18:35:35 +02:00
Przemysław Jacewicz da019495d7 [dhp-actionmanager] target dir removal added for distcp actions 2020-04-15 17:56:57 +02:00
miconis 5689d49689 minor changes 2020-04-15 16:34:06 +02:00
Claudio Atzori c439d0c6bb PromoteActionPayloadForGraphTableJob reads directly the content pointed by the input path, adjusted promote action tests (ISLookup mock) 2020-04-15 16:18:33 +02:00
Claudio Atzori ff30f99c65 using newline delimited json files for the raw graph materialization. Introduced contentPath parameter 2020-04-15 16:16:20 +02:00
Sandro La Bruzzo 3d3ac76dda Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-04-15 15:24:01 +02:00
Sandro La Bruzzo 74a7fac774 fixed problem with timestamp 2020-04-15 15:23:54 +02:00
Miriam Baglioni 3577219127 removed unuseful classes 2020-04-15 12:45:49 +02:00
Miriam Baglioni 964b22d418 modified the writing of the new relations. before: read old rels, add the new ones to them, write all the relations in new location. Now: first step of the wf copies the old relation i new location. If new relations are found, they are saved in the new location in append mode. 2020-04-15 12:32:01 +02:00
Miriam Baglioni 43f0590d4b change in the testing because the business logic is changed. 2020-04-15 12:29:50 +02:00
Miriam Baglioni 473d17767c new business logic for the actual propagation. It exploits previously computed information 2020-04-15 12:25:44 +02:00
Miriam Baglioni 6a377a7582 class to compute some information needed for the actual propagation 2020-04-15 12:25:11 +02:00
Miriam Baglioni 5a3487280d classes to serialize/deserialize the prepared data 2020-04-15 12:24:36 +02:00
Miriam Baglioni 62b09be43c added correct descritption for parameter isSparkSessionManaged 2020-04-15 12:23:06 +02:00
Miriam Baglioni 1859ce8902 minor refactoring 2020-04-15 12:21:31 +02:00
Miriam Baglioni 27f1d3ee8f minor refactoring 2020-04-15 12:21:05 +02:00
Alessia Bardi 550a9f82ed Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-04-14 17:53:01 +02:00
Alessia Bardi a68fae9bcb now supporting openaire 4.0 compliance 2020-04-14 17:52:48 +02:00
Sandro La Bruzzo c36239e693 fixed incremental indexing 2020-04-14 17:47:36 +02:00
Miriam Baglioni 3f4b579e7f new workflow. It is composed of four steps. The first removes the directory where to store the results. The second copies the relation to the new locatio, the third id the preparation phase and then the actual propagation 2020-04-14 16:49:24 +02:00
Miriam Baglioni ca2b40952e minor changes 2020-04-14 16:48:02 +02:00
Miriam Baglioni 61d39e659e parameters for the project2result propagation phase 2020-04-14 16:47:39 +02:00
Miriam Baglioni 92f19fa0a0 parameters for the project2result preparation phase 2020-04-14 16:46:57 +02:00
Miriam Baglioni cadab9b81d new implementation for result to project propagation. Use the prepared info in propagation 2020-04-14 16:46:07 +02:00
Miriam Baglioni ceb1f299bf minor changes 2020-04-14 16:45:12 +02:00
Claudio Atzori 82e8341f50 reorganizing parameter names in the provision workflow 2020-04-14 15:54:41 +02:00
Miriam Baglioni e0038bde5b Support class to serialize/deserialize the association project, set of linked results 2020-04-14 15:32:12 +02:00
Miriam Baglioni c0bebb7c35 code to compute the prepared information used in the actual propagation step. This step will produce who files: one with potential updates (association between projects and a list of results), the other already linked entities (association between projects and the list of results already linked to them) 2020-04-14 15:31:26 +02:00
Miriam Baglioni f47ee5b78e directory where to store the prepared infor before actual propagation will take place 2020-04-14 15:29:21 +02:00
Miriam Baglioni 36cc9516d8 the starting relation set for testing 2020-04-14 15:28:34 +02:00
Miriam Baglioni 4b01dc60e6 test unit for result to project propagation 2020-04-14 15:28:00 +02:00
Miriam Baglioni 8f12292daa changed the way to save the results on filesystem 2020-04-11 16:47:34 +02:00
Miriam Baglioni 87f802821e new workflow for country propagation: it is composed of the preparation step and in the propagation. The propagation part runs in parallel on the result types 2020-04-11 16:40:22 +02:00
Miriam Baglioni a562080b0b parameters to be used in the prepared Job and in the actual country propagation job 2020-04-11 16:39:17 +02:00
Miriam Baglioni 1251ad4455 removed unuseful class 2020-04-11 16:38:13 +02:00
Miriam Baglioni aef9b3aa90 new parametric implementation of country propagation. Exploits information compute before and broadcasts it to each executor 2020-04-11 16:36:59 +02:00
Miriam Baglioni a2d833d5dd step of data preparation before actual country propagation will take palce 2020-04-11 16:36:03 +02:00
Miriam Baglioni 6897c920a2 classes in support of new implementation of country propagation 2020-04-11 16:35:26 +02:00
Miriam Baglioni 85766a02d8 added dependency to use hive on local machine 2020-04-11 16:34:22 +02:00
Miriam Baglioni 79b8ea4fed prepared information to be used in actual country propagation. Subset of info 2020-04-11 16:29:41 +02:00
Miriam Baglioni 1822476613 Test for country propagation 2020-04-11 16:28:09 +02:00
Miriam Baglioni 7783b09c5b new implementation for result to project propagation. Prepare some info to be used in propagation 2020-04-11 16:26:23 +02:00
Claudio Atzori 6b5f9ca9cb raw graph creation workflow moved under dhp-graph-mapper, claims integration is included 2020-04-10 17:53:07 +02:00
Miriam Baglioni 90469789b9 two new classes fro new implementation of project to result propagation 2020-04-09 13:29:01 +02:00
Miriam Baglioni 627ad58a8b new wf definition 2020-04-09 11:33:19 +02:00
Miriam Baglioni 9c63c4840d new workflow and parameters for country propagation 2020-04-08 19:13:42 +02:00
Miriam Baglioni a2d309545b new parametrized implementation for country propagation 2020-04-08 19:12:59 +02:00
Miriam Baglioni 6dfdba9ef7 new parametrized implementation for country propagation 2020-04-08 18:14:37 +02:00
Miriam Baglioni 03f7cb6402 new parametrized implementation for country propagation 2020-04-08 18:08:41 +02:00
Miriam Baglioni df2fc4a6d7 Merge remote-tracking branch 'upstream/master' 2020-04-08 18:07:26 +02:00
Miriam Baglioni fcfef4632f input parameters for country propagation preparation job 2020-04-08 18:07:18 +02:00
miconis 0be2e72be5 further implementation of tests for the deduplication of each entity. publication dump added, empty entity files created 2020-04-08 18:02:30 +02:00
Miriam Baglioni 61045e84d9 merged conflict in pom 2020-04-08 14:23:30 +02:00
Claudio Atzori 47f3d9b757 unit test for GraphHiveImporterJob 2020-04-08 13:24:43 +02:00
Sandro La Bruzzo ba9f07a6fe fixed wrong test 2020-04-08 13:18:20 +02:00
Miriam Baglioni 540da4ab61 new busuness logic with prepared info before actual job run 2020-04-08 13:04:04 +02:00
Miriam Baglioni 8438702b3d addition in propagation constants 2020-04-08 10:54:01 +02:00
Miriam Baglioni 2afe971816 new implementation for country propagatio 2020-04-08 10:49:09 +02:00
Miriam Baglioni beebbcf66b new config for countrypropagation 2020-04-08 10:31:29 +02:00
Claudio Atzori d74e128aa6 Utility classes moved in dhp-common and dhp-schemas 2020-04-07 11:56:22 +02:00
Claudio Atzori c57cf679ca Merge branch 'provision_dataset' 2020-04-07 08:56:58 +02:00
Claudio Atzori 1a1a026a18 we do expect to find field bestaccessright already defined. No need to add it again 2020-04-07 08:55:33 +02:00
Claudio Atzori fbdd18a96b using dataset based relation preparation procedure 2020-04-07 08:54:39 +02:00
Claudio Atzori 77f59b1b10 dataset based provision WIP 2020-04-06 19:37:27 +02:00
Claudio Atzori 6177cf36fb Merge pull request 'Closes #4: New action manager implementation' (#5) from przemyslaw.jacewicz/dnet-hadoop:przemyslawjacewicz_actionmanager_impl_prototype into master
Nothing more to add here. Thanks for your contribution!
2020-04-06 17:35:07 +02:00
Claudio Atzori e355961997 dataset based provision WIP 2020-04-06 17:34:25 +02:00
miconis 56fbe689f0 implementation of the tests for each spark action 2020-04-06 16:30:31 +02:00
Claudio Atzori ca345aaad3 dataset based provision WIP 2020-04-06 15:33:31 +02:00
Claudio Atzori c8f4b95464 dataset based provision WIP 2020-04-06 08:59:58 +02:00
Claudio Atzori eb2f5f3198 dataset based provision WIP 2020-04-04 17:41:31 +02:00
Claudio Atzori 3d1b637cab dataset based provision WIP 2020-04-04 14:03:43 +02:00
miconis 53fd624c34 implemented test for sparkcreatesimrels 2020-04-03 18:32:25 +02:00
Claudio Atzori 24b2c9012e dataset based provision WIP 2020-04-02 18:44:09 +02:00
miconis a61763d149 structure for sparksimrel changed to be compliant with mockito testing 2020-04-02 18:37:53 +02:00
Claudio Atzori daa26acc9d dataset based provision WIP, fixed spark2EventLogDir 2020-04-02 16:15:50 +02:00
Przemysław Jacewicz 7b2a7e2417 [dhp-actionmanager] missing descriptions added and minor naming and formatting fixes 2020-04-02 11:48:40 +02:00
Spyros Zoupanos 1ab97bbe00 Adding the full stats workflow to the dnet-hadoop hierarchy 2020-04-01 22:22:05 +03:00
Claudio Atzori 9c7092416a dataset based provision WIP 2020-04-01 19:07:30 +02:00
miconis bfa5bc74df minor changes 2020-04-01 19:05:48 +02:00
Przemysław Jacewicz 80cf43b9c8 [dhp-actionmanager] promoting workflow added 2020-04-01 18:51:25 +02:00
Przemysław Jacewicz 5b459bcc47 [dhp-actionmanager] promoting spark job added 2020-04-01 18:49:08 +02:00
miconis 9802bcb9fe dedup testing 2020-04-01 18:48:31 +02:00
Przemysław Jacewicz e21bb89dbd [dhp-actionmanager] partitioning spark job added 2020-04-01 18:41:29 +02:00
Przemysław Jacewicz f9f7350bb9 [dhp-actionmanager] common package added with utility classes supporting hadoop and spark envs 2020-04-01 18:39:26 +02:00
Przemysław Jacewicz ad70c23b2e [dhp-actionmanager] pom updated 2020-04-01 18:36:00 +02:00
Przemysław Jacewicz 4e910a78d4 [dhp-workflows] spark 2 connection properties added 2020-04-01 18:29:26 +02:00
Claudio Atzori 1402eb1fe7 cleanup 2020-04-01 15:38:50 +02:00
Claudio Atzori 7061d07727 ActionSets migration serialize the output as plain text files instead of SequenceFiles 2020-04-01 14:58:22 +02:00
Claudio Atzori adcdd2d05e WIP: reimplementing the adjacency list construction process using spark Datasets 2020-04-01 14:56:57 +02:00
Sandro La Bruzzo 205e9521c6 implemented import crossref job 2020-04-01 14:12:33 +02:00
Sandro La Bruzzo 201d79021e Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-31 14:41:41 +02:00
Sandro La Bruzzo cd7416ae4c first implementation of incremental update of scholix index 2020-03-31 14:41:35 +02:00
przemek 9d1d18d4b9 Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-03-31 12:04:58 +02:00
Claudio Atzori 377e1ba840 [maven-release-plugin] prepare for next development iteration 2020-03-30 20:06:00 +02:00
Claudio Atzori 76d9315129 [maven-release-plugin] prepare release dhp-1.1.6 2020-03-30 20:05:56 +02:00
Claudio Atzori ef429010ee removed log file and job-override.properties 2020-03-30 20:00:58 +02:00
Claudio Atzori 0fbec69b82 use oozie prepare statement to cleanup working directories 2020-03-30 19:48:41 +02:00
Claudio Atzori 3af2b8d700 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-30 13:12:21 +02:00
Claudio Atzori f3f9affd49 allow dynamic executors to build XML records 2020-03-30 13:12:11 +02:00
Claudio Atzori 2e2d4c4c68 adjusted path to template resource 2020-03-30 13:11:49 +02:00
Miriam Baglioni dd011f4a95 to make them visible to Claudio 2020-03-30 10:55:47 +02:00
Miriam Baglioni b1af90a45f to make it visible to Claudio 2020-03-30 10:50:03 +02:00
Sandro La Bruzzo 62cc257e5c fixed step1 workflow 2020-03-27 17:07:34 +01:00
Sandro La Bruzzo 1a7a866861 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-27 15:11:48 +01:00
Sandro La Bruzzo 7cef698f36 reformat code 2020-03-27 15:11:34 +01:00
Claudio Atzori 1767dfaa3f method can be protected, it is meant to be used only in tests 2020-03-27 14:31:26 +01:00
Sandro La Bruzzo a4b6a51168 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-27 13:48:56 +01:00
Sandro La Bruzzo 15d9106b3f FIxed merge of dhp dedup 2020-03-27 13:48:44 +01:00
Claudio Atzori e196fff212 adjusted path for source resource in unit test 2020-03-27 13:45:10 +01:00
Sandro La Bruzzo 8c9a56a0c8 refactored package name 2020-03-27 13:19:33 +01:00
Sandro La Bruzzo 2bd2d6f202 Merge branch 'master' of code-repo.d3science.org:D-Net/dnet-hadoop 2020-03-27 13:16:36 +01:00
Sandro La Bruzzo a9935f80d4 refactor class name and workflow name for graph mapper, added javadoc 2020-03-27 13:16:24 +01:00
Michele Artini ae03948eed Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-27 11:47:07 +01:00
Michele Artini f6e86b44a6 tests 2020-03-27 11:46:37 +01:00
Michele Artini 408be3c632 test and fixed a problem with datacite namespaces 2020-03-27 11:44:50 +01:00
Claudio Atzori 673e744649 moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa 2020-03-27 10:42:17 +01:00
Claudio Atzori 098fabab3f reorganizing content under dhp-workflows/dhp-graph-mapper 2020-03-26 19:44:19 +01:00
Claudio Atzori 77c4294924 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-26 18:26:52 +01:00
Claudio Atzori 43cbcda7ef unit test for SparkGraphImporterJob 2020-03-26 18:26:40 +01:00
Sandro La Bruzzo e04da6d66a merged all oozie wf in one 2020-03-26 14:17:07 +01:00
Sandro La Bruzzo e71e001b58 commented test that doesn't work 2020-03-26 14:15:21 +01:00
Sandro La Bruzzo 0cd022ad6a merge with master 2020-03-26 14:08:29 +01:00
Claudio Atzori abcd3f5bf5 added sample data for unit tests 2020-03-26 11:12:52 +01:00
Sandro La Bruzzo d5f11e27be renamed wf 2020-03-26 09:49:23 +01:00
Sandro La Bruzzo 9a37ad0127 renamed modules 2020-03-26 09:46:46 +01:00
Sandro La Bruzzo a768226e52 updated generate scholix to generate json 2020-03-26 09:40:50 +01:00
Claudio Atzori 9dff4adbc3 dhp-graph-mapper workflow tests upgraded to junit5 2020-03-25 18:25:12 +01:00
Claudio Atzori cd7dc3e1ae dhp-dedup-openaire workflow tests upgraded to junit5 2020-03-25 18:04:23 +01:00
Claudio Atzori c0e825e713 dhp-aggregation workflow tests upgraded to junit5 2020-03-25 17:59:45 +01:00
Michele Artini ebe45003d9 fixed some junit packages 2020-03-25 16:45:03 +01:00
Michele Artini d9bfdcd607 updated poms 2020-03-25 16:31:12 +01:00
Michele Artini 120e823cd1 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-25 16:00:10 +01:00
Claudio Atzori 71ae7dd272 renamed module dnet-dedup to dnet-dedup-openaire 2020-03-25 15:57:09 +01:00
Michele Artini fd57722c69 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-25 15:56:49 +01:00
Claudio Atzori f441f823dd fixed path referencing a test resource file 2020-03-25 15:21:46 +01:00
Claudio Atzori 51d0c9bdd7 integrated changes from branch dedupTest 2020-03-25 15:15:41 +01:00
Claudio Atzori 36f8f2ea66 master set to 'yarn' in spark actions, removed path to rawSet from the dedup scan workflow 2020-03-25 14:16:06 +01:00
Michele Artini 2559299da4 tests 2020-03-25 12:25:00 +01:00
Claudio Atzori 2180cc4fe7 more fields included in result view definition 2020-03-25 11:21:46 +01:00
Claudio Atzori efb0b7d660 master set to 'yarn' in spark actions 2020-03-25 11:15:35 +01:00
Michele Artini 0fda2c3a30 some tests on db records 2020-03-25 09:43:58 +01:00
miconis 02320de371 minor changes 2020-03-24 17:43:51 +01:00
miconis 8e8b5e8f30 roots wf merged in scan wf 2020-03-24 17:40:58 +01:00
Miriam Baglioni 19d7f8b51d decommented execution for some of the result type for testing purposes 2020-03-24 16:49:46 +01:00
Miriam Baglioni ad24c8478f added missing parameter 2020-03-24 16:19:59 +01:00
Miriam Baglioni 46094a3eec bug fixing for implementation with dataset 2020-03-24 16:19:36 +01:00
Claudio Atzori 51ff68db66 Merge branch 'dedupTest' of https://code-repo.d4science.org/D-Net/dnet-hadoop into dedupTest 2020-03-24 11:18:19 +01:00
Claudio Atzori 1e869e7bed using method available from currently used library 2020-03-24 11:17:44 +01:00
miconis f0d72b76a8 package structure fixed 2020-03-24 10:51:40 +01:00
Claudio Atzori aaedbb1b8b WIP: dedup workflow, stage 2 2020-03-24 09:59:28 +01:00
Michele Artini e3760c7f39 fix a bug with organization countries 2020-03-24 08:43:56 +01:00
Claudio Atzori 8b0ba3d76a posprocessing script correctly run as hive2 action 2020-03-23 17:40:39 +01:00
miconis 93e2291291 minor changes 2020-03-23 17:17:56 +01:00
miconis f7890a90df implementation of the mechanism that checks the existance of a mergerel file 2020-03-23 17:13:30 +01:00
Miriam Baglioni ad712f2d79 added the needed variables in the config and read the variables in the workflow 2020-03-23 17:11:36 +01:00
Miriam Baglioni f1e9fe9752 changed implementation using dataset and query on hive 2020-03-23 17:11:00 +01:00
Miriam Baglioni f09cd1e911 removed unuseful variable in the configuration 2020-03-23 17:10:14 +01:00
Miriam Baglioni 9418e3d4fa read dataset from files instead of using hive tables 2020-03-23 17:09:27 +01:00
Miriam Baglioni a7bf037306 remove unused class 2020-03-23 14:36:43 +01:00
Miriam Baglioni 8ab8b6b0bf minor 2020-03-23 14:35:23 +01:00
Miriam Baglioni 30d58fd98c change the configuration of the workflow 2020-03-23 14:32:49 +01:00
Miriam Baglioni a440152b46 refactoring 2020-03-23 14:30:56 +01:00
Miriam Baglioni 47561f3597 changed the implementation from rdd to dataset got from sql queries (on hive) 2020-03-23 11:58:32 +01:00
miconis c20e179f5a structure of the workflows updated 2020-03-23 11:43:49 +01:00
Claudio Atzori 658d40ccbe WIP trying to use hive2 actions 2020-03-23 11:14:54 +01:00
Claudio Atzori ecb64e4998 Merge branch 'migration_wfs_regular_all_steps' 2020-03-23 08:57:01 +01:00
Michele Artini 15160032bd fixed a bug setting some organization fields 2020-03-23 08:39:14 +01:00
Claudio Atzori a4c52661a0 WIP: fixing dedup workflows 2020-03-20 19:17:24 +01:00
Claudio Atzori 6cb0a9bff0 dedup wf directory structure aligned with project commons 2020-03-20 16:48:14 +01:00
miconis e16e644faf implementation of the workflow for entity update and for relations update 2020-03-20 13:01:56 +01:00
przemek 638b78f96a Merge remote-tracking branch 'origin/master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-03-19 15:12:56 +01:00
miconis 4e82a24af2 minor changes and implementation of the create connected components action 2020-03-19 15:01:07 +01:00
Claudio Atzori 36236dd1c1 action migration workflow produces eu.dnetlib.dhp.schema.action.AtomicAction(s) 2020-03-19 14:00:38 +01:00
Claudio Atzori a0ab15a64c need to stick on using guava:11.0.2 as it is the version used by the hadoop components (oozie client for sure). The last version (28.2-jre) breaks the oozie workflow submission 2020-03-19 13:58:58 +01:00
Sandro La Bruzzo 0594b92a6d implemented relation with dataset 2020-03-19 11:11:07 +01:00
miconis 679b5869e5 implementation of the lookup procedure to take dedup conf from the resource profiles 2020-03-18 17:41:56 +01:00
Claudio Atzori abe8fb69a2 added global properties, moved postprocessing script inside the oozie_app directory 2020-03-18 15:43:54 +01:00
miconis f32eae5ce9 implementation of the spark action for the simrel creation 2020-03-18 14:27:49 +01:00
Claudio Atzori c7e0730720 compress the output produced by migration steps 1 and 2 2020-03-18 09:34:57 +01:00
Claudio Atzori 2f11e37602 fixed expansion of path variables 2020-03-17 19:41:07 +01:00
Claudio Atzori 2795b0b096 no need to mkdir a the all_entities file 2020-03-17 17:22:14 +01:00
Claudio Atzori 19746ad308 when reuseContent, reset ${workingPath}/all_entities 2020-03-17 17:17:06 +01:00
Claudio Atzori 2f0c85eeb3 updated parameters for regular_all_steps worfklow, introduced flag 'reuseContent' 2020-03-17 17:04:58 +01:00
Miriam Baglioni 67ea3cf3ed changed the way to read the file with info on resource or relation. From sequenceFile to textFile 2020-03-17 16:32:05 +01:00
Miriam Baglioni b4652d018c moved the creation of new dir to common class. 2020-03-17 16:31:24 +01:00
Claudio Atzori b8290b5851 updated parameters for regular_all_steps worfklow 2020-03-17 15:45:30 +01:00
Claudio Atzori 4706f24ec5 updated parameters for regular_all_steps worfklow 2020-03-17 15:23:54 +01:00
Claudio Atzori aeb01fa353 reading from newline delimited json textfiles instead of sequence files 2020-03-17 11:57:24 +01:00
Miriam Baglioni 92f4e0001d Merge branch 'bulktag' 2020-03-16 13:33:27 +01:00
Miriam Baglioni ab08a37024 Merge remote-tracking branch 'upstream/master' 2020-03-16 12:45:23 +01:00
Claudio Atzori af835f2f98 when migrating actionsets from DM cluster, populate the AtomicAction.targetValue when empty (dedup similarities) 2020-03-15 18:07:59 +01:00
Claudio Atzori 9c84e21b87 added workflow to migrate latest version of each actionset content from DM to OCEAN cluster, mapping the targetValues from the old protobuf data model to the dhp.OAF datamodel 2020-03-13 15:56:52 +01:00
Claudio Atzori 8fe7ae1482 xml formatting 2020-03-13 15:53:56 +01:00
Przemysław Jacewicz d0c9b0cdd6 WIP promote job functions updated 2020-03-13 12:36:42 +01:00
Przemysław Jacewicz 8d9b3c5de2 WIP action payload mapping into OAF type moved, (local) graph table name enum created, tests fixed 2020-03-13 10:01:39 +01:00
Przemysław Jacewicz 5cc560c7e5 Removed unnecessary dependency on old OAF model 2020-03-13 09:57:46 +01:00
Sandro La Bruzzo addaaa091f migrate relation from RDD to Dataset 2020-03-13 09:13:20 +01:00
Przemysław Jacewicz 3f24593e51 WIP: promote job tests and test resources implementation snapshot 2020-03-11 17:06:29 +01:00
Przemysław Jacewicz 2e996d610f WIP: promote job functions implementation snapshot 2020-03-11 17:02:57 +01:00
Przemysław Jacewicz cc63cdc9e6 WIP: promote job implementation snapshot 2020-03-11 17:02:06 +01:00
Przemysław Jacewicz 69540f6f78 Serialization-safe supplier added 2020-03-11 16:59:05 +01:00
Przemysław Jacewicz e6e214dab5 Oaf merge and get strategy added 2020-03-11 16:58:17 +01:00
Claudio Atzori 7b6f0c8756 reading graph dump as text files, encoded as newline-delimited JSON records, as indicated in the wiki 2020-03-10 17:19:17 +01:00
Claudio Atzori 60aedb1110 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-10 17:09:44 +01:00
Claudio Atzori a3f184fd3f added field websiteurl in related organizations 2020-03-10 17:08:58 +01:00
Claudio Atzori 0e95544495 fixed serialization for datasource subjects 2020-03-10 17:07:44 +01:00
Sandro La Bruzzo 7b28783fb4 updated unpaywall mapping 2020-03-08 17:00:19 +01:00
Michele Artini b6efa9d6ab Configuration of the SequenceFile Writer 2020-03-05 15:49:14 +01:00
Claudio Atzori 5e342a555c no need to compute the inverse relClass, fixed text() in xpath expressions 2020-03-05 12:51:48 +01:00
Claudio Atzori 6ec04d4e02 specified column used to perform the join operation in the javadoc 2020-03-05 12:50:38 +01:00
Michele Artini 7a2a466161 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-04 14:50:59 +01:00
Michele Artini 755eade2fb fix creation ids 2020-03-04 14:49:45 +01:00
Claudio Atzori 6379f32466 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-04 10:57:06 +01:00
Claudio Atzori 0233987603 introduced post processing step following the hive DB creation/population 2020-03-04 10:56:50 +01:00
Claudio Atzori 1e563bc15e introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase 2020-03-04 10:55:11 +01:00
Claudio Atzori 9af3e904be close the SparkSession at the end 2020-03-04 10:53:31 +01:00
Michele Artini e7167b996a logs and closeable 2020-03-04 10:46:36 +01:00
Claudio Atzori 25ceec29ab code formatting 2020-03-04 10:44:24 +01:00
Claudio Atzori 63c00c5e88 fixed typo 2020-03-04 10:43:44 +01:00
Miriam Baglioni c37f2bd1b5 moved some classes to package to make code clearer 2020-03-03 16:42:23 +01:00
Miriam Baglioni d9d2060561 implementation for bulk tagging 2020-03-03 16:38:50 +01:00
Miriam Baglioni e80f80ca93 properties and workflow for new propagation 2020-03-02 17:03:31 +01:00
Claudio Atzori 9cf5ce2e66 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-02 17:03:10 +01:00