Commit Graph

332 Commits

Author SHA1 Message Date
Claudio Atzori e725c88ebb [raw_all] patching relation identifier phase to be run at the end, i.e. includes also claimed relations 2021-07-29 13:03:43 +02:00
Claudio Atzori 5d08ad86ae [raw_all] patching relation identifier phase to be run at the end, i.e. includes also claimed relations 2021-07-29 13:03:16 +02:00
Claudio Atzori e87e1805c4 [raw_all] added extra workflow step for patching the identifiers in the relations, given an id mapping dataset 2021-07-29 12:13:06 +02:00
Michele Artini c72c960ffb added eosc fields 2021-07-28 11:03:15 +02:00
Michele Artini 1fb572a33a added eosc fields 2021-07-28 10:52:24 +02:00
Claudio Atzori d267dce520 [raw_all] added extra workflow step for patching the identifiers in the relations, given an id mapping dataset 2021-07-27 17:18:29 +02:00
Sandro La Bruzzo 8fac10c91e fixed defintion wf of creation final infospace of scholexplorer 2021-07-25 11:15:37 +02:00
Sandro La Bruzzo d9e3b89937 implemented last part of workflows to generate scholixGraph 2021-07-23 16:38:32 +02:00
Sandro La Bruzzo ca74e8dd02 create a separate wf for resolving relation 2021-07-23 11:40:06 +02:00
Sandro La Bruzzo 31d2d6d41e Scholexplorer: introduction of dedup openaire 2021-07-21 18:09:32 +02:00
Miriam Baglioni 774cdb190e changes to mirror the last dump of the graph with the ols data model. 2021-07-13 18:57:24 +02:00
Miriam Baglioni 52ce35d57b - 2021-07-13 18:08:46 +02:00
Miriam Baglioni 970b387b8d modification to allow dump of a single community 2021-07-13 18:08:10 +02:00
Miriam Baglioni c028feef4f workflow for the dump as sub workflows 2021-07-13 18:06:44 +02:00
Sandro La Bruzzo 09fccf8000 added workflow to serialize scholix and summary in json 2021-07-09 11:01:42 +02:00
Sandro La Bruzzo cd17e19044 implemented branch workflow to import datacite and crossref in scholexplorer 2021-07-08 21:20:19 +02:00
Sandro La Bruzzo 8a034e46e1 updated baseline workflow 2021-07-08 11:11:41 +02:00
Sandro La Bruzzo 8535506c22 added scholix generation 2021-07-06 17:18:06 +02:00
Sandro La Bruzzo c6fa8598e1 massive code refactor:
removed modules dhp-*-scholexplorer
2021-07-01 22:13:45 +02:00
Sandro La Bruzzo 84b834c893 added test dataset test for pangaea 2021-06-30 17:31:09 +02:00
Sandro La Bruzzo 1a6b398968 implemented Creation of Raw Graph and Resolution 2021-06-30 17:27:55 +02:00
Sandro La Bruzzo 623a0c4edb code Refactor, renaming packages 2021-06-30 11:09:30 +02:00
Sandro La Bruzzo f36f92287d implemented mapping from Crossref Event Data to Oaf 2021-06-29 10:21:23 +02:00
Sandro La Bruzzo 511ec14c63 implemented mapping from EBI and Scholix Resolved to OAF 2021-06-28 22:04:22 +02:00
Sandro La Bruzzo ad50415167 Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer 2021-06-24 17:20:50 +02:00
Sandro La Bruzzo 80e15cc455 implemented mapping from uniprot, pdb and ebi links 2021-06-24 17:20:00 +02:00
Claudio Atzori 50fc5a64a0 [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-06-23 11:49:42 +02:00
Sandro La Bruzzo cc0f2b11fb Implemented mapping from pubmed baseline to OAF 2021-06-16 14:56:24 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori dd19c4ac5a Merge pull request 'import_new_mdstores' (#112) from import_new_mdstores into stable_ids
Reviewed-on: #112
2021-06-14 09:23:55 +02:00
Sandro La Bruzzo e57294ac99 implemented changes on PUBMed dataflow 2021-06-03 10:52:09 +02:00
Michele Artini e950750262 add nodes to import hdfs mdstores 2021-06-01 10:48:50 +02:00
Michele Artini e9f2b6037c patch of mdstore records 2021-05-31 11:36:26 +02:00
Michele Artini ad56a44fda save as gzipped sequence file 2021-05-28 14:45:39 +02:00
Michele Artini 4fa5671d16 first implementation of Hdfs Mdstores Importer 2021-05-27 16:22:07 +02:00
Sandro La Bruzzo 714b71bd21 updated pubmed 2021-05-04 14:54:12 +02:00
Alessia Bardi 9a20057615 fixed query for organisations' pids 2021-04-29 15:23:39 +02:00
Sandro La Bruzzo 7f8848ecdd added first implementation of Pangaea Mapping 2021-04-27 11:30:37 +02:00
miconis 0393cdce42 addition of alternative names in export queries 2021-04-20 12:45:21 +02:00
miconis cadd0a5de8 modification of the queries for openorgs: they now consider also pending orgs 2021-04-20 12:06:56 +02:00
miconis 11b22b2d23 bug fix in the query, it now exports only relations with non-hidden organizations 2021-04-08 11:51:47 +02:00
miconis 0857100fb8 implementation of the tests for the openorgs integration in the openaire provision 2021-04-07 18:42:16 +02:00
miconis bf685d849f addition of pids in the query for the export of openorgs for the provision, addition of ec_fields in the openorgs model 2021-04-07 14:27:43 +02:00
miconis eaaefb8b4c implementation of the procedure to reuse content of different dbs when creating the raw graph 2021-04-06 14:35:51 +02:00
miconis c39c82dfe9 modification of the jobs for the integration of openorgs in the provision, dedup records are no more created by merging but simply taking results of openorgs portal 2021-04-06 14:31:00 +02:00
Claudio Atzori 9237d55d7f [OpenOrgsWf] cleanup 2021-03-29 17:40:34 +02:00
Claudio Atzori 7f4e9479ec [OpenOrgsWf] graph construction wf: allow to skip the import openorgs node (importOpenorgs true|false) 2021-03-29 16:59:16 +02:00
miconis f446580e9f code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup 2021-03-29 16:10:46 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
miconis 28c1cdd132 merged stable_ids into openorgswf 2021-03-25 10:44:49 +01:00
miconis 348b0ef921 bug fix, implementation of the workflow for the creation of raw_organizations (openorgs dedup), addition of the pid lists to the openorgs postgres db 2021-03-24 15:51:27 +01:00
Claudio Atzori 8d2bb24512 merged from master 2021-03-08 15:44:34 +01:00
miconis 4b2124a18e implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities 2021-02-10 11:51:50 +01:00
Michele Artini 991e675dc6 validation in claim rels 2020-12-14 15:41:25 +01:00
Claudio Atzori 12e2f930c8 resolved conflicts 2020-12-10 10:57:39 +01:00
Miriam Baglioni 5fb65ffc4a merge branch with master 2020-12-03 11:24:35 +01:00
Miriam Baglioni ea88dc3401 fixed issue in property name 2020-12-03 11:24:23 +01:00
Claudio Atzori 893ac4a77b GenerateEntitiesApplication can be configured to hash the id value or not 2020-12-02 09:30:06 +01:00
Miriam Baglioni 5fbe54ef54 #61 (comment) 2020-11-25 18:10:28 +01:00
Miriam Baglioni ed01e5a5e1 #61 (comment) 2020-11-25 18:09:34 +01:00
Miriam Baglioni f5e5e92a10 changed because of #61 (comment) 2020-11-25 17:58:53 +01:00
Claudio Atzori dfd6205b95 Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
Miriam Baglioni e7e418e444 added decision node to verify if to upload in Zenodo 2020-11-25 13:44:10 +01:00
Miriam Baglioni 39f4a20873 chenged the path and the name for saving the communities_infrastructures dump file 2020-11-24 14:47:32 +01:00
Miriam Baglioni 7e14452a87 final versione of the wf to get the dump of results associated to at least one funder per funder 2020-11-24 14:46:34 +01:00
Miriam Baglioni c167a18057 added new parameter for the dumpType 2020-11-24 14:45:50 +01:00
Claudio Atzori 33bae02451 reverted behaviour of the cleaning workflow: grouping entities by ID will be managed differently 2020-11-24 14:42:33 +01:00
Miriam Baglioni 0a9db67eec - 2020-11-20 12:21:33 +01:00
Miriam Baglioni cf3f47563f new parameter files 2020-11-19 19:16:05 +01:00
Miriam Baglioni 24c56fa7a3 new logic and workflow for dump of results with link to projects. In this implementation the result match the model of the communityresult. 2020-11-19 19:15:39 +01:00
Miriam Baglioni fafb688887 - 2020-11-18 18:56:48 +01:00
Miriam Baglioni 46ba3793f6 code, workflow and parameters for the dump of the results associated to funders 2020-11-18 16:47:31 +01:00
Miriam Baglioni 57cac36898 changed the workflow name 2020-11-18 13:38:03 +01:00
Claudio Atzori 6ab1ce53c9 fixed condition in result pid cleaning; cleanup 2020-11-16 10:09:17 +01:00
Claudio Atzori 4de8c8b237 fixed workflow variable name 2020-11-16 10:03:11 +01:00
Claudio Atzori 5d4e34e26a fixed typo in variable name 2020-11-14 10:32:26 +01:00
Claudio Atzori 528231a287 grouping graph entities by id turned out to be an easy extension for the already existing cleaning workflow 2020-11-13 15:37:48 +01:00
Claudio Atzori 2bed29eb09 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:12 +01:00
Claudio Atzori 13e36a4da0 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:02 +01:00
Michele Artini 40160d171f organizations pids 2020-11-09 12:58:36 +01:00
Claudio Atzori d10447e747 re-packaged graph dump workflow sources 2020-11-05 17:38:18 +01:00
Miriam Baglioni f8e9bda24c merge branch with master 2020-11-05 16:31:18 +01:00
Miriam Baglioni be5ed8f554 added check to avoid sending empty metadata. 2020-11-05 16:10:17 +01:00
Claudio Atzori 2148a51fae minor changes 2020-11-05 11:24:12 +01:00
Miriam Baglioni b90a945c49 removed property files for pid graph dump 2020-11-04 17:28:33 +01:00
Alessia Bardi 51808b5afd Updated descriptions 2020-11-04 12:29:48 +01:00
Alessia Bardi e6becf8659 Updated descriptions 2020-11-04 12:17:57 +01:00
Alessia Bardi 0abe0eee33 Updated descriptions 2020-11-04 12:15:30 +01:00
Alessia Bardi f6ab238f5d Updated descriptions 2020-11-04 11:50:47 +01:00
Miriam Baglioni c209284ca7 new schemas for the entities in the dump with added descriptions 2020-11-03 16:58:08 +01:00
Miriam Baglioni 08806deddf added the splitSize non mandatory parameter. Default size 10G 2020-11-03 16:57:34 +01:00
Miriam Baglioni 7d2eda43ca added new non mandatory property publish to determine if to publish the upload or leave it pending. Default value flase 2020-11-03 16:57:01 +01:00
Miriam Baglioni d4382b54df moved the tar archive with maz size on common module 2020-11-03 16:54:50 +01:00
Miriam Baglioni 78fdb11c3f merge branch with master 2020-10-29 12:55:22 +01:00
Sandro La Bruzzo 1d9fdb7367 fixed spark memory issue in SparkSplitOafTODLIEntities 2020-10-28 12:30:32 +01:00
Miriam Baglioni 3241ec1777 added connection timeout and socket timeout 600 sec 2020-10-27 16:12:11 +01:00
Claudio Atzori b961dc7d1e added originalid to the fields in the result graph view 2020-10-09 13:53:15 +02:00
Miriam Baglioni 11b7eaae09 changed the name of the folder where to store the context entity from context to communities_infrastructures 2020-10-05 11:24:54 +02:00
Claudio Atzori c2a6e2a9bf fixed mapping for datasource journal info (ISSNs) 2020-10-02 09:37:08 +02:00
Miriam Baglioni 01117a46e1 whole workflow activated 2020-10-01 17:19:21 +02:00
Miriam Baglioni fcaedac980 merge branch with master 2020-10-01 16:46:59 +02:00
Claudio Atzori 4287164aba include relevantdate field in the result view 2020-10-01 10:28:55 +02:00
Miriam Baglioni 983a12ed15 temporary modification to allow the upload of files in the sandbox without the neew to recreate the mapping from scratch 2020-09-25 16:41:51 +02:00
Miriam Baglioni 8b36d19182 added property depositionId and chenage property newVersion that became string from boolean to handle the three possible distinct values 2020-09-25 16:41:15 +02:00
Miriam Baglioni 54800fb9b0 enabled only the step to upload in zenodo 2020-09-25 14:40:22 +02:00
Miriam Baglioni de6c4d46d8 fixed conflicts 2020-09-24 15:35:01 +02:00
Claudio Atzori 044d3a0214 fixed query used to load datasources in the Graph 2020-09-24 13:48:58 +02:00
Claudio Atzori 42f55395c8 fixed order of the ISSNs returned by the SQL query 2020-09-24 12:09:58 +02:00
Claudio Atzori 9a7e72d528 using concat_ws to join textual columns from PSQL. When using || to perform the concatenation, Null columns makes the operation result to be Null 2020-09-24 10:42:47 +02:00
Miriam Baglioni e2ceefe9be - 2020-09-14 14:33:28 +02:00
Miriam Baglioni 40c8d2de7b test resources for the dump of the pids graph 2020-08-24 16:50:39 +02:00
Miriam Baglioni c5858afb88 added parameter to guide the dump for the result (resultAggregation). true if all the result types should be dump together, false otherwise. 2020-08-19 11:24:14 +02:00
Miriam Baglioni 5570678c65 changed parameter name from hfdsNameNode to nameNode 2020-08-19 10:59:26 +02:00
Miriam Baglioni 37e7c43652 changed parameter name from hdfsNaemNode to nameNode 2020-08-14 18:18:25 +02:00
Miriam Baglioni acb0926b2e json schemas for the dumped entities and relation 2020-08-11 15:39:48 +02:00
Miriam Baglioni ff52c51f92 added the communityMapPath parameter and removed the isLookUpUrl parameter 2020-08-11 15:39:22 +02:00
Miriam Baglioni 6f43acda5e added the maketar and send to zenodo step. Adjusted wf parameters 2020-08-11 15:38:20 +02:00
Miriam Baglioni ddc19de2e9 removed the isLookUpUrl among the parameters 2020-08-11 15:37:47 +02:00
Miriam Baglioni 592a8ea573 added parameter file for maketar class 2020-08-11 15:37:14 +02:00
Miriam Baglioni 77a0951b32 added the make archive step in the workflow 2020-08-11 15:32:32 +02:00
Miriam Baglioni fe88904df0 changed the wf definition 2020-08-10 12:01:14 +02:00
Miriam Baglioni 1cf7043e26 removed isLookUoUrl from the parameters 2020-08-10 11:38:03 +02:00
Miriam Baglioni 46986aae2d added the new parameter for newdeposion/newversion and concept_record_id 2020-08-07 18:00:06 +02:00
Miriam Baglioni da9b012c15 fixed dewcription 2020-08-06 11:55:44 +02:00
Miriam Baglioni 6dbadcf181 the new schema for the dumped result 2020-08-06 11:05:56 +02:00
Miriam Baglioni 5b651abf82 merge branch with master 2020-08-04 10:14:07 +02:00
Miriam Baglioni 901ae37f7b added step to workflow 2020-08-03 18:12:54 +02:00
Miriam Baglioni e43aeb139a added new property file and changed some parameter to old files 2020-08-03 18:07:28 +02:00
Miriam Baglioni c892c7dfa7 changed to query for community map just once and save the result for remaining executions 2020-08-03 17:56:31 +02:00
Miriam Baglioni 6f1c40a933 - 2020-07-30 16:24:28 +02:00
Miriam Baglioni 2b66a93f9e added property file that was missing 2020-07-30 16:24:17 +02:00
Michele Artini bdece15ca0 blacklist of nsprefix 2020-07-30 16:13:38 +02:00
Sandro La Bruzzo 3010a362bc updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset 2020-07-30 09:25:56 +02:00
Miriam Baglioni b48934f6df changed the workflow name 2020-07-29 17:43:43 +02:00
Miriam Baglioni 8ad8dac7d4 merge branch with fork master 2020-07-29 17:38:28 +02:00
Miriam Baglioni 40a8dafbdc - 2020-07-29 17:30:44 +02:00
Miriam Baglioni 8d4327b292 input parameters and workflow definition for the dump of the whole graph 2020-07-29 17:00:34 +02:00
Miriam Baglioni 178c2729a7 changed the path to reach the java class to be executed 2020-07-29 12:29:51 +02:00
Miriam Baglioni 437ac12139 removed unused parameter 2020-07-29 12:28:16 +02:00
Miriam Baglioni 332258d199 split the classes related to the communities dump and to the whole graph dump 2020-07-24 17:21:48 +02:00
Claudio Atzori 56bbfdc65d introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project' 2020-07-23 08:54:10 +02:00
Miriam Baglioni 40bbe94f7c merge with master fork 2020-07-20 18:10:03 +02:00
Miriam Baglioni 23160b4d29 realignment of the workflow classes with the changes in the structure of the module 2020-07-20 18:04:30 +02:00
Claudio Atzori e0c4cf6f7b added parameter to drive the graph merge strategy: priority (BETA|PROD) 2020-07-20 10:48:01 +02:00
Claudio Atzori 94ccdb4852 Merge branch 'master' into merge_graph 2020-07-20 10:14:55 +02:00
Sandro La Bruzzo 9116d75b3e Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-17 18:01:30 +02:00
Claudio Atzori 878f2b931c Merge branch 'master' into merge_graph 2020-07-16 16:34:24 +02:00
Claudio Atzori cc77446dc4 added dbSchema parameter to the raw_db workflow 2020-07-10 19:01:50 +02:00
Sandro La Bruzzo c01efed79b Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-10 14:44:57 +02:00
Sandro La Bruzzo a7d3977481 added generation of EBI Dataset 2020-07-10 14:44:50 +02:00