Commit Graph

2172 Commits

Author SHA1 Message Date
Antonis Lempesis d23ccae0d5 ignoring deletedbyinference relations 2020-12-04 12:42:17 +02:00
Miriam Baglioni 5fb65ffc4a merge branch with master 2020-12-03 11:24:35 +01:00
Miriam Baglioni ea88dc3401 fixed issue in property name 2020-12-03 11:24:23 +01:00
Miriam Baglioni 4c58bd1c93 merge with upstream 2020-12-03 11:24:00 +01:00
Miriam Baglioni 05c452f58d merge with upstream 2020-12-03 10:26:45 +01:00
Antonis Lempesis 413afcfed5 finished first implementation of wf 2020-12-02 15:57:17 +02:00
Antonis Lempesis 0948536614 initial implementation of the promote wf 2020-12-02 15:41:56 +02:00
Sandro La Bruzzo 7da679542f fixed wrong projectId 2020-12-02 14:28:09 +01:00
Sandro La Bruzzo 6ba8037cc7 fixed failure to test due to changing of input 2020-12-02 11:34:46 +01:00
Claudio Atzori cfb55effd9 code formatting 2020-12-02 11:23:49 +01:00
Claudio Atzori 74242e450e using constants from ModelConstants 2020-12-02 11:23:35 +01:00
Miriam Baglioni d5efa6963a using constants in ModelCOnstants 2020-12-02 11:20:26 +01:00
Miriam Baglioni cd285e98bc usoing the constants defined in the ModelConstants class 2020-12-02 11:13:23 +01:00
Miriam Baglioni 4b0d1530a2 merge upstream 2020-12-02 11:05:00 +01:00
Claudio Atzori faa977df7e Merge pull request 'orcid-no-doi' (#43) from enrico.ottonello/dnet-hadoop:orcid-no-doi into master
The dataset was generated and is now part of the actionsets available in BETA
2020-12-02 10:55:12 +01:00
Claudio Atzori 57f448b7a4 graph cleaning workflow separate orcid_pending from orcid, depending on the author pid provenance 2020-12-02 10:44:05 +01:00
Alessia Bardi 2d15667b4a testing XML generation from json object (case AMS ACTA) 2020-12-02 10:16:26 +01:00
Alessia Bardi a417624670 tests for raw graph mapping 2020-12-02 10:15:26 +01:00
Claudio Atzori 893ac4a77b GenerateEntitiesApplication can be configured to hash the id value or not 2020-12-02 09:30:06 +01:00
Miriam Baglioni f8468c9c22 added extention for new author pid (orcid_pending) 2020-12-01 20:09:35 +01:00
Miriam Baglioni 888175baf7 added java doc 2020-12-01 18:36:29 +01:00
Miriam Baglioni 3d62d99d5d fixed issue in workflow variable 2020-12-01 15:02:49 +01:00
Miriam Baglioni 17680296b9 removed unnecessary variable and unused method 2020-12-01 15:02:31 +01:00
Miriam Baglioni 5b3ed70808 refactoring 2020-12-01 14:31:34 +01:00
Miriam Baglioni 62ff4999e3 added workflow and last step of collection and save 2020-12-01 14:30:56 +01:00
Miriam Baglioni 45d06c45c7 collecting all the atoic actions for result type and save them all in the AS path 2020-12-01 14:29:18 +01:00
Miriam Baglioni 0051ebede5 extending test 2020-12-01 12:43:03 +01:00
Miriam Baglioni 719da15f04 added test resources 2020-12-01 12:42:30 +01:00
Miriam Baglioni db36e11912 classes test classes and resources for production of the actionset to include bipFinder score in results 2020-11-30 20:14:23 +01:00
Enrico Ottonello f2df3ead74 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-30 14:22:46 +01:00
Enrico Ottonello 40c4559e92 added datainfo on authors pid with "sysimport:crosswalk:entityregistry", 2020-11-30 14:19:22 +01:00
Claudio Atzori 2c407e775e GenerateEntitiesApplication can be configured to hash the id value or not 2020-11-30 12:00:38 +01:00
Antonis Lempesis 815d6b25d9 added last step to update cache 2020-11-30 00:48:10 +02:00
Claudio Atzori 758d27745d cleaning tab characters from text fields 2020-11-27 16:07:24 +01:00
Claudio Atzori e731a7658d cleaning texts to remove tab characters too 2020-11-27 09:00:04 +01:00
Claudio Atzori 5151850a19 CROSSREF and DATACITE constants moved in common ModelConstants 2020-11-26 13:08:36 +01:00
Claudio Atzori a104d2b6ad cleanup 2020-11-26 11:12:00 +01:00
Claudio Atzori d0d5525d40 minor changes 2020-11-26 11:04:17 +01:00
Claudio Atzori 13eae4b31e GroupEntitiesSparkJob must read all graph paths but relations 2020-11-26 11:04:01 +01:00
Claudio Atzori 76363a8512 SimpleDateFormat is not thread safe; improved error reporting in case of invalid dates 2020-11-26 11:03:12 +01:00
Claudio Atzori c1b9a4045a grouping of records will be performed by the dedup workflow 2020-11-26 10:59:10 +01:00
Miriam Baglioni 124591a7f3 refactoring 2020-11-25 18:23:28 +01:00
Miriam Baglioni 1a89f8211c D-Net/dnet-hadoop#61 (comment) 2020-11-25 18:12:40 +01:00
Miriam Baglioni 5fbe54ef54 D-Net/dnet-hadoop#61 (comment) 2020-11-25 18:10:28 +01:00
Miriam Baglioni ed01e5a5e1 D-Net/dnet-hadoop#61 (comment) 2020-11-25 18:09:34 +01:00
Miriam Baglioni d4ddde2ef2 changed because of D-Net/dnet-hadoop#61 (comment) 2020-11-25 18:01:01 +01:00
Miriam Baglioni f5e5e92a10 changed because of D-Net/dnet-hadoop#61 (comment) 2020-11-25 17:58:53 +01:00
Miriam Baglioni 1df94b85b4 changed because of D-Net/dnet-hadoop#61 (comment) 2020-11-25 17:57:43 +01:00
Claudio Atzori db0181b8af Merge pull request 'added bidirectionality to relations from project and result coming from crossref' (#60) from miriam.baglioni/dnet-hadoop:sxBidirectionality into master 2020-11-25 17:17:40 +01:00
Sandro La Bruzzo ec3e238de6 Fixed problem on duplicated identifier 2020-11-25 17:15:54 +01:00
Claudio Atzori e208b03755 renamed workflow 2020-11-25 14:55:50 +01:00
Claudio Atzori dfd6205b95 Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
Miriam Baglioni 90d4369fd2 added test to verify the compression in writing community info on hdfs 2020-11-25 14:34:58 +01:00
Miriam Baglioni 6750e33d69 merge branch with master 2020-11-25 14:09:01 +01:00
Miriam Baglioni b2c455f883 added java doc 2020-11-25 14:08:09 +01:00
Miriam Baglioni 1f130cdf92 changed the relation (produces -> isProducedBy) due to the change in the code 2020-11-25 14:04:26 +01:00
Miriam Baglioni e758d5d9b4 refactoring 2020-11-25 13:46:39 +01:00
Miriam Baglioni 87a9f616ae refactoring and addition of the funder nsp first part as nome for the dump insteasd of the whole nsp 2020-11-25 13:45:41 +01:00
Miriam Baglioni e7e418e444 added decision node to verify if to upload in Zenodo 2020-11-25 13:44:10 +01:00
Miriam Baglioni 305e3d0c9c added resource file for relation with relClass = isProducedBy 2020-11-25 13:43:41 +01:00
Miriam Baglioni 21ce175d17 added FilterFunction specification if filter operation 2020-11-25 13:42:31 +01:00
Miriam Baglioni bde6d337dd test classes for dump of results related to funders 2020-11-25 13:42:01 +01:00
Miriam Baglioni b37b9352d7 added constant value for semantic relationship between projects and results 2020-11-25 13:41:08 +01:00
Sandro La Bruzzo 264723ffd8 updated stuff for zenodo upload 2020-11-25 11:56:07 +01:00
Claudio Atzori 36173c13a5 reverted filters in the clening process 2020-11-25 10:24:42 +01:00
Claudio Atzori eeebd5a920 Cleanig workflow: remove newlines from titles, descriptions, subjects 2020-11-24 18:40:25 +01:00
Claudio Atzori e1a1bb3ee4 moved class CleaningFunctions in the correct package. Remove newlines from titles, descriptions, subjects 2020-11-24 18:34:03 +01:00
Enrico Ottonello 99a086f0c6 max concurrent executors set to 10, according to ORCID Director of Technology mail request 2020-11-24 17:49:32 +01:00
Miriam Baglioni 72bb0fe360 changed directory name 2020-11-24 16:47:07 +01:00
Miriam Baglioni 00874a8ce6 added bidirectionality to relations from project and result 2020-11-24 15:17:23 +01:00
Miriam Baglioni 39f4a20873 chenged the path and the name for saving the communities_infrastructures dump file 2020-11-24 14:47:32 +01:00
Miriam Baglioni 7e14452a87 final versione of the wf to get the dump of results associated to at least one funder per funder 2020-11-24 14:46:34 +01:00
Miriam Baglioni c167a18057 added new parameter for the dumpType 2020-11-24 14:45:50 +01:00
Miriam Baglioni 54a309bb6b refactoring 2020-11-24 14:45:30 +01:00
Miriam Baglioni 35ecea8842 changed to consider the modification for the specification of the type of dump 2020-11-24 14:45:15 +01:00
Miriam Baglioni b9b6bdb2e6 fixing issue on previous implementation 2020-11-24 14:44:53 +01:00
Miriam Baglioni 7e940f1991 changed to consider the modification for the specification of the type of dump 2020-11-24 14:43:34 +01:00
Miriam Baglioni 62928ef7a5 changed to save the communities_infrastructures information as the other entity dumps: in a json.gz file 2020-11-24 14:42:41 +01:00
Claudio Atzori 33bae02451 reverted behaviour of the cleaning workflow: grouping entities by ID will be managed differently 2020-11-24 14:42:33 +01:00
Miriam Baglioni 3319440c53 changed the direction of the relation between projects and result considered to select the results linked to projects 2020-11-24 14:41:09 +01:00
Miriam Baglioni 00c377dac2 added specification of MapFunction types in map 2020-11-24 14:40:22 +01:00
Miriam Baglioni 44db258dc4 added enumerated for the dump type 2020-11-24 14:38:06 +01:00
Miriam Baglioni 1832708c42 modified boolean variable with string one whcih specify the type of dump we are performing: complete, community or funder 2020-11-24 14:37:36 +01:00
Enrico Ottonello 5c17e768b2 set wf configuration with spark.dynamicAllocation.maxExecutors 20 over 20 input partitions 2020-11-23 16:01:23 +01:00
Enrico Ottonello 5c9a727895 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-23 09:49:53 +01:00
Enrico Ottonello 97c8111847 action to convert lambda file in seq file; spark action to download updated authors 2020-11-23 09:49:22 +01:00
Miriam Baglioni 259c67ce36 fixed issue in path name 2020-11-20 12:32:23 +01:00
Miriam Baglioni 0a9db67eec - 2020-11-20 12:21:33 +01:00
Miriam Baglioni d362f2637d merge branch with master 2020-11-19 19:17:20 +01:00
Miriam Baglioni cf3f47563f new parameter files 2020-11-19 19:16:05 +01:00
Miriam Baglioni 24c56fa7a3 new logic and workflow for dump of results with link to projects. In this implementation the result match the model of the communityresult. 2020-11-19 19:15:39 +01:00
Claudio Atzori d48f388fb2 Merge branch 'provision_indexing' 2020-11-19 15:59:55 +01:00
Claudio Atzori 46bde9c13f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-19 15:26:27 +01:00
Claudio Atzori 7c9feaf9e7 project attributes removed from the XML record serialization: contactfullname, contactfax, contactphone, contactemail 2020-11-19 15:26:20 +01:00
Claudio Atzori fcbb05eb21 cleanup 2020-11-19 15:14:33 +01:00
Claudio Atzori 3f34757c63 merged from master 2020-11-19 14:34:54 +01:00
Michele Artini 293da47ad9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-11-19 10:42:31 +01:00
Michele Artini ab08d12c46 considering abstract > MIN_LENGTH in ENRICH_MISSING_ABSTRACT 2020-11-19 10:42:10 +01:00
Claudio Atzori e503271abe fixed notification workflow name 2020-11-19 10:41:38 +01:00
Claudio Atzori 0374d34c3e introduced configuration param outputFormat: HDFS | SOLR 2020-11-19 10:34:28 +01:00
Miriam Baglioni fafb688887 - 2020-11-18 18:56:48 +01:00
Miriam Baglioni 906db690d2 - 2020-11-18 17:43:08 +01:00
Claudio Atzori ede7fae6c8 Merge pull request 'XML record indexing test' (#58) from provision_indexing into master 2020-11-18 17:04:34 +01:00
Miriam Baglioni 5402062ff5 changed parameter file with the ono associated to the job 2020-11-18 16:58:20 +01:00
Miriam Baglioni a172a37ad1 fixed typo 2020-11-18 16:55:07 +01:00
Miriam Baglioni 46ba3793f6 code, workflow and parameters for the dump of the results associated to funders 2020-11-18 16:47:31 +01:00
Claudio Atzori 5218718e8b updated set of fields from the MDFormatDSResourceType on PROD 2020-11-18 15:00:41 +01:00
Claudio Atzori d9e07a242b extended XmlIndexingJob to accept an optional parameter: outputPath. When present, forces the job to write its output on the specified HDFS location 2020-11-18 14:34:55 +01:00
Claudio Atzori 29dcff0f34 spark complains about missing classes, so here they are again 2020-11-18 14:32:32 +01:00
Miriam Baglioni 57cac36898 changed the workflow name 2020-11-18 13:38:03 +01:00
Claudio Atzori 12acf25519 Merge pull request 'starting from first step...' (#57) from antonis.lempesis/dnet-hadoop:master into master
No judging. Just re-deploying...
2020-11-18 11:01:49 +01:00
Claudio Atzori 8177ce7939 test for XmlIndexingJob based on a local miniSolrCluster 2020-11-18 10:58:05 +01:00
Alessia Bardi 10e673660f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-18 10:01:23 +01:00
Alessia Bardi be7b310cef rel semantcis ignore case 2020-11-18 10:01:20 +01:00
Michele Artini 33da2e3d6c xpaths for dateOfCollection and dateOfTransformation 2020-11-18 09:26:20 +01:00
Antonis Lempesis 01a6e03989 starting from first step... 2020-11-17 23:26:47 +02:00
Alessia Bardi 8f87020a50 #56: map relevantDates from aggregated ODF records 2020-11-17 18:42:09 +01:00
Alessia Bardi 7e0a76a8ac test fr TextGrid 2020-11-17 18:39:25 +01:00
Enrico Ottonello 2b0c9bbb7e Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-17 18:24:34 +01:00
Enrico Ottonello c0c2e05eae added wf to extracting authors and works xml data from orcid dump to hdfs; added wf to download the lamda file (containing last orcid update informations) from orcid to hdfs 2020-11-17 18:23:12 +01:00
Claudio Atzori cfc01f136e PID filtering based on a blacklist 2020-11-17 12:27:06 +01:00
Dimitris bbcf6b7c8b Commit 17112020 2020-11-17 08:36:51 +02:00
Enrico Ottonello c796adae24 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-16 11:57:19 +01:00
Claudio Atzori 6ab1ce53c9 fixed condition in result pid cleaning; cleanup 2020-11-16 10:09:17 +01:00
Claudio Atzori 4de8c8b237 fixed workflow variable name 2020-11-16 10:03:11 +01:00
Dimitris 3e24c9b176 Changes 14112020 2020-11-14 18:42:07 +02:00
Claudio Atzori 331d621800 added test resource 2020-11-14 12:16:15 +01:00
Claudio Atzori 5d4e34e26a fixed typo in variable name 2020-11-14 10:32:26 +01:00
Claudio Atzori 768bc5304c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-13 15:40:34 +01:00
Claudio Atzori 93f7b7974f Merge pull request 'trust truncated to 3 decimals' (#24) from trunc_trust into master
LGTM
2020-11-13 15:40:02 +01:00
Claudio Atzori 528231a287 grouping graph entities by id turned out to be an easy extension for the already existing cleaning workflow 2020-11-13 15:37:48 +01:00
Enrico Ottonello 005f849674 added compression to output dataset 2020-11-13 12:45:31 +01:00
Enrico Ottonello 9a2fa9dc2f added test for other names parsing from summaries dump 2020-11-13 10:25:34 +01:00
Claudio Atzori 2bed29eb09 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:12 +01:00
Claudio Atzori 13e36a4da0 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:02 +01:00
Enrico Ottonello 13f28fa225 moved AuthorData to dhp-schemas; added other names to author data 2020-11-12 17:43:32 +01:00
Enrico Ottonello 2af21150c5 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-12 09:58:33 +01:00
Claudio Atzori 9b0fb9e958 merged from master 2020-11-12 09:27:12 +01:00
Claudio Atzori 75324ae58a Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-12 09:23:37 +01:00
Claudio Atzori 822971f54f no need to filter relations in CreateRelatedEntitiesJob_phase1; replaced 'left outer' join with 'left' join in CreateRelatedEntitiesJob_phase2; cleanup; 2020-11-12 09:22:59 +01:00
Enrico Ottonello 1f861f2b0d now wf output is a sequence file with the format seq("eu.dnetlib.dhp.schema.oaf.Publication",eu.dnetlib.dhp.schema.action.AtomicActions) 2020-11-11 17:38:50 +01:00
Claudio Atzori 9841488482 Merge pull request 'latest changes in stats wf' (#54) from antonis.lempesis/dnet-hadoop:master into master
LGTM, thanks!
2020-11-11 16:01:51 +01:00
Antonis Lempesis 99ebaee347 fixed #5913 2020-11-11 16:56:46 +02:00
Claudio Atzori e3d3481fb9 Merge pull request 'organizations pids' (#53) from organization_pids into master
LGTM
2020-11-11 14:08:25 +01:00
Antonis Lempesis f14e65f6a3 reverted wrong change 2020-11-10 17:23:04 +02:00
Antonis Lempesis c02c7741c9 fixes in db creation 2020-11-10 17:11:30 +02:00
Antonis Lempesis e603fa5847 fixes in db creation 2020-11-10 17:11:12 +02:00
Enrico Ottonello fea2451658 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-10 11:49:43 +01:00
Claudio Atzori 18d9aad70c improved documentation in dhp-graph-provision 2020-11-10 11:48:55 +01:00
Enrico Ottonello 1513174d7e added further test case 2020-11-10 11:44:55 +01:00
Michele Artini 40160d171f organizations pids 2020-11-09 12:58:36 +01:00
Sandro La Bruzzo 8e1d43aab2 Implemented ID generation using IdentifierRecordFactory on DOIBoost 2020-11-09 11:53:55 +01:00
Sandro La Bruzzo 027ef2326c Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-11-06 17:12:42 +01:00
Sandro La Bruzzo cd27df91a1 fixed bug on missing relation in ANDS 2020-11-06 17:12:31 +01:00
Enrico Ottonello 6bc7dbeca7 first version of dataset successful generated from orcid dump 2020 2020-11-06 13:47:50 +01:00
Claudio Atzori d10447e747 re-packaged graph dump workflow sources 2020-11-05 17:38:18 +01:00
Claudio Atzori 2d76497488 cleanup 2020-11-05 17:10:24 +01:00
Miriam Baglioni f8e9bda24c merge branch with master 2020-11-05 16:31:18 +01:00
Miriam Baglioni be5ed8f554 added check to avoid sending empty metadata. 2020-11-05 16:10:17 +01:00
Claudio Atzori 2148a51fae minor changes 2020-11-05 11:24:12 +01:00
Claudio Atzori 4625b7486e code formatting 2020-11-04 18:12:43 +01:00
Claudio Atzori f5f346dd2b Merge pull request 'dump' (#50) from miriam.baglioni/dnet-hadoop:dump into master
LGTM
2020-11-04 18:07:01 +01:00
Miriam Baglioni e9ac471ae9 removed dependency from classes for the pid graph dump 2020-11-04 18:04:42 +01:00
Miriam Baglioni b90a945c49 removed property files for pid graph dump 2020-11-04 17:28:33 +01:00
Miriam Baglioni bac307155a removed properties specific for pid graph dump 2020-11-04 17:28:04 +01:00
Miriam Baglioni 9c9d50f486 removed code specific for pid graph dump 2020-11-04 17:26:22 +01:00
Miriam Baglioni 5669890934 removed commented lines 2020-11-04 17:15:21 +01:00
Miriam Baglioni 6a89f59be9 removed commented lines 2020-11-04 17:13:59 +01:00
Miriam Baglioni 56150d7e5e removed all code related to the dump of pids graph 2020-11-04 17:13:12 +01:00
Miriam Baglioni 16c54a96f8 removed pid dump 2020-11-04 17:11:32 +01:00
Claudio Atzori e5da4ee9b1 dedup workflow using the common PidComparator 2020-11-04 15:02:02 +01:00
Miriam Baglioni 0cac5436ff Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump 2020-11-04 13:21:11 +01:00
Alessia Bardi 51808b5afd Updated descriptions 2020-11-04 12:29:48 +01:00
Alessia Bardi e6becf8659 Updated descriptions 2020-11-04 12:17:57 +01:00
Alessia Bardi 0abe0eee33 Updated descriptions 2020-11-04 12:15:30 +01:00
Alessia Bardi f6ab238f5d Updated descriptions 2020-11-04 11:50:47 +01:00
Sandro La Bruzzo 3581244daf Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-11-04 09:04:22 +01:00
Sandro La Bruzzo 66efb39634 implemented merge scholix 2020-11-04 09:04:01 +01:00
Miriam Baglioni c010a8442f fixed issue on test code 2020-11-03 17:26:51 +01:00
Miriam Baglioni 8ec7a61188 merge branch with master 2020-11-03 16:59:08 +01:00
Miriam Baglioni c209284ca7 new schemas for the entities in the dump with added descriptions 2020-11-03 16:58:08 +01:00
Miriam Baglioni 08806deddf added the splitSize non mandatory parameter. Default size 10G 2020-11-03 16:57:34 +01:00
Miriam Baglioni 7d2eda43ca added new non mandatory property publish to determine if to publish the upload or leave it pending. Default value flase 2020-11-03 16:57:01 +01:00
Miriam Baglioni cbbb1bdc54 moved business logic to new class in common for handling the zip of hte archives 2020-11-03 16:55:50 +01:00
Miriam Baglioni d4382b54df moved the tar archive with maz size on common module 2020-11-03 16:54:50 +01:00
Claudio Atzori 86d6fbe95b refactoring: CleaningFunctions and OafMapperUtils moved in dhp-commong 2020-11-03 12:19:46 +01:00
Claudio Atzori 8471888ad3 Merge branch 'graph_cleaning' into stable_ids 2020-11-03 11:52:47 +01:00
Claudio Atzori 5310e56dba remove empy PIDs 2020-11-03 11:52:10 +01:00
Claudio Atzori 3fcd669e99 result merge operation leverage on custom ResultTypeComparator in the aggregator graph construction 2020-11-03 10:53:23 +01:00
Claudio Atzori 8e7f81c5f5 code formatting 2020-11-02 14:25:00 +01:00
Claudio Atzori 09e44dabff Merge branch 'master' into stable_ids 2020-11-02 12:16:01 +01:00
Sandro La Bruzzo 754c86f33e fixed test to work on jenkins 2020-11-02 09:35:01 +01:00
Sandro La Bruzzo 39337d8a8a fixed test 2020-11-02 09:26:25 +01:00
Dimitris 32bf943979 Changes to download only updates 2020-11-02 09:08:25 +02:00
Miriam Baglioni dabb33e018 changed the discriminant for which split the file 2020-10-30 17:52:22 +01:00
Claudio Atzori c5dda3a00c Merge pull request 'h2020classification' (#49) from miriam.baglioni/dnet-hadoop:h2020classification into master
LGTM
2020-10-30 17:10:05 +01:00
Miriam Baglioni 4905739be6 changed resource file to mirror change in business logic 2020-10-30 17:02:57 +01:00
Miriam Baglioni b40360ebfb changed the code to mirror the changed decision in the classification level and prodramme description labels 2020-10-30 17:02:30 +01:00
Miriam Baglioni 696409fb9f disabled tests because needing remote resource 2020-10-30 17:01:48 +01:00
Miriam Baglioni 0fba08eae4 max allowed size per file 10 Gb 2020-10-30 16:05:55 +01:00
Claudio Atzori 385214eeae code formatting 2020-10-30 15:47:05 +01:00
Claudio Atzori 04ad8969b2 anticipated execution of the graph cleaning workflow 2020-10-30 15:46:55 +01:00
Claudio Atzori 4ca75d6951 Merge pull request 'Dedup ID creation policy' (#48) from deduptesting into stable_ids 2020-10-30 15:15:32 +01:00
Miriam Baglioni b828587252 prevent the code to cicle indefinetly 2020-10-30 15:01:25 +01:00
Miriam Baglioni f747e303ac classes for dumping of the graph as ttl file 2020-10-30 14:13:45 +01:00
Miriam Baglioni 16baf5b69e formatting 2020-10-30 14:13:14 +01:00
Miriam Baglioni a9eef9c852 added check for possible Optional value in relation dataInfo 2020-10-30 14:12:28 +01:00
Miriam Baglioni 5f4de9a962 formatting 2020-10-30 14:11:40 +01:00
Miriam Baglioni 14bf2e7238 added option to split dumps bigger that 40Gb on different files 2020-10-30 14:09:04 +01:00
Dimitris b8a3392b59 Commit 30102020 2020-10-30 14:07:21 +02:00
Claudio Atzori 58f28296ea ProvisionConstants moved as ModelHardLimits in dhp-common and applied to truncate long abstracts (len > 150000). Further filtering for empty PID values 2020-10-30 10:56:42 +01:00
Miriam Baglioni 78fdb11c3f merge branch with master 2020-10-29 12:55:22 +01:00
Sandro La Bruzzo 1d9fdb7367 fixed spark memory issue in SparkSplitOafTODLIEntities 2020-10-28 12:30:32 +01:00
Miriam Baglioni d2374e3b9e added code to handle cases where the funding tree is not existing 2020-10-27 16:15:21 +01:00
Miriam Baglioni 5d3012eeb4 changed code to dump only the programme list and not the classification list 2020-10-27 16:14:18 +01:00
Miriam Baglioni 3241ec1777 added connection timeout and socket timeout 600 sec 2020-10-27 16:12:11 +01:00
Enrico Ottonello 9818e74a70 added dependency version in main pom.xml for orcid no doi 2020-10-22 16:38:00 +02:00
Enrico Ottonello 210a50e4f4 replaced null value 2020-10-22 16:24:42 +02:00
Enrico Ottonello b0290dbcb7 moved all dependencies version to main pom.xml 2020-10-22 16:20:46 +02:00
Enrico Ottonello a38ab57062 let run test methods 2020-10-22 15:43:50 +02:00
Enrico Ottonello 1139d6568d replaced null value with a more safe empty string as return value 2020-10-22 15:32:26 +02:00
Enrico Ottonello c58db1c8ea added filter on null value after map function 2020-10-22 15:11:02 +02:00
Enrico Ottonello 846ba30873 if typologies mapping fails, an exception will be propagated 2020-10-22 14:36:18 +02:00
Enrico Ottonello c3114ba0ae replaced null as return value with a more safe empty string 2020-10-22 14:21:31 +02:00
Enrico Ottonello c295c71ca0 added comment 2020-10-22 14:07:26 +02:00
Enrico Ottonello ab083f9946 propagate exception on parsing work (PR request) 2020-10-22 14:02:32 +02:00
sandro 3a81a940b7 solved bug on merge publication 2020-10-21 22:41:55 +02:00
Miriam Baglioni a2ce527fae changed to match the requirements for short titles in level and long titles in classification 2020-10-20 17:03:25 +02:00
Sandro La Bruzzo 346ed65e2c added upload to zenodo node 2020-10-20 16:59:55 +02:00
sandro 271b4db450 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-10-20 16:09:49 +02:00
sandro d58d02d448 added workflow upload on zenodo 2020-10-20 16:09:07 +02:00
miconis c4a59d1b9a merge with the master to port the new packages 2020-10-20 16:07:30 +02:00
miconis 708d887e64 minor changes 2020-10-20 15:12:19 +02:00
miconis 0e54803177 bug fix in the id generator and implementation of jobs for organization dedup 2020-10-20 12:19:46 +02:00
Alessia Bardi 1425d810a8 testing mapping 2020-10-19 17:46:14 +02:00
Claudio Atzori 266bf1a221 common IdentifierFactory in use on the mapping from the aggregator data; merge the entities sharing the same id; code formatting 2020-10-16 17:02:10 +02:00
Claudio Atzori 34f1d0904b common IdentifierFactory in use on the mapping from the aggregator data 2020-10-16 16:00:19 +02:00
Sandro La Bruzzo fed711da80 Merge remote-tracking branch 'origin/master' into merge_record_to_common 2020-10-13 15:32:45 +02:00
Sandro La Bruzzo 34bf64c94f fixed export Scholexplorer to OpenAire 2020-10-13 08:47:58 +02:00
Alessia Bardi 8775a64bc1 Merge pull request 'Merging different compatibility levels (pinocchio operator)' (#47) from merge_graph into master 2020-10-09 14:44:52 +02:00
Claudio Atzori e751c1402f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-10-09 13:53:21 +02:00
Claudio Atzori b961dc7d1e added originalid to the fields in the result graph view 2020-10-09 13:53:15 +02:00
miconis 6f8720982c bug fix in the idgenerator and test implementation 2020-10-09 09:30:23 +02:00
Sandro La Bruzzo 734934e2eb fixed error on empty intersection with publication and relation on export to OAF 2020-10-08 17:29:29 +02:00
Sandro La Bruzzo eec418cd26 moved AuthoreMerger into dhp-common 2020-10-08 10:33:55 +02:00
Sandro La Bruzzo fe0a7870e6 Added test to check if merge authors works 2020-10-08 10:33:12 +02:00
Sandro La Bruzzo cd9c377d18 adpted scholexplorer Dump generation to the new Dataset definition 2020-10-08 10:10:13 +02:00
Claudio Atzori a3f37a9414 javadoc 2020-10-07 16:44:22 +02:00
Claudio Atzori 8d85a2fced [BETA wf only] datasources involved in the merge operation doesn't obey to the infra precedence policy, but relies on a custom behaviour that, given two datasources from beta and prod returns the one from prod with the highest compatibility among the two 2020-10-07 16:28:52 +02:00
Claudio Atzori 5f7b75f5c5 code formatting 2020-10-07 13:22:54 +02:00
miconis 1804c5d809 refactoring: classes moved in the right package 2020-10-06 16:44:51 +02:00
miconis 7093355487 bug fix and minor changes 2020-10-06 16:21:34 +02:00
miconis 5a8bc329c5 bug fix in the result merge: it takes the correct bestaccessright basing on the license instead of the trust 2020-10-06 15:26:44 +02:00
miconis a2ac7e52fb implementation of the workflow for new organizations in openorgs 2020-10-06 13:58:09 +02:00
Miriam Baglioni 061527f06e adding short description 2020-10-05 13:54:39 +02:00
Miriam Baglioni 0c12d7bdd8 adding short description 2020-10-05 11:39:55 +02:00
Miriam Baglioni ae08b3c0dd merge branch with master 2020-10-05 11:35:55 +02:00
Miriam Baglioni 11b7eaae09 changed the name of the folder where to store the context entity from context to communities_infrastructures 2020-10-05 11:24:54 +02:00
Miriam Baglioni 32bffb0134 changed the name from communities_infrastructures to communities_infrastuctures.json 2020-10-05 11:24:17 +02:00
Claudio Atzori 23f64d9eb4 updated dedup tests following the dnet-pace-core library update 2020-10-02 14:30:53 +02:00
Miriam Baglioni fc2f7636be removed not used code 2020-10-02 12:33:52 +02:00
Miriam Baglioni 25cbcf6114 changed to solve issues about names. context renamed communities_infrastructure.json and removed the double json.gz extention to the name of the part in the tar 2020-10-02 12:17:46 +02:00
Claudio Atzori 9db0f88fb8 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-10-02 09:43:35 +02:00
Claudio Atzori 49ae3450a9 code formatting 2020-10-02 09:43:24 +02:00
Claudio Atzori c2a6e2a9bf fixed mapping for datasource journal info (ISSNs) 2020-10-02 09:37:08 +02:00
Miriam Baglioni 01117a46e1 whole workflow activated 2020-10-01 17:19:21 +02:00
Miriam Baglioni cfb5766c6b removed double json.gz from names of files in the tar 2020-10-01 17:18:34 +02:00
Miriam Baglioni fcaedac980 merge branch with master 2020-10-01 16:46:59 +02:00
Miriam Baglioni c6e6ed1bd8 merge branch with master 2020-10-01 16:24:41 +02:00
Miriam Baglioni 4aec347351 refactoring 2020-10-01 16:23:52 +02:00
Miriam Baglioni 61946b4092 refactoring 2020-10-01 16:22:48 +02:00
Miriam Baglioni 7e6d35e56c added the link to the excel file related to topic 2020-10-01 15:53:31 +02:00
Sandro La Bruzzo 1a0a44e85a Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-10-01 15:46:53 +02:00
Sandro La Bruzzo c4a3c52e45 fixed Doiboost bug in the identifier 2020-10-01 15:46:44 +02:00
Miriam Baglioni 43cbd62c2b added classpath.first in the configuration 2020-10-01 15:46:34 +02:00
Miriam Baglioni cd69c6b023 added dependency for the topic file path 2020-10-01 15:45:59 +02:00
Miriam Baglioni 771cde3d05 moved the library version to global pom 2020-10-01 15:43:47 +02:00
Miriam Baglioni 632351c0da modified test resources to mirror the changed in the code 2020-10-01 15:43:02 +02:00
Miriam Baglioni ebc1c5513f modified test resources to mirror the changed in the code 2020-10-01 15:42:29 +02:00
Miriam Baglioni 3a374c34b6 fixed null pointer exception 2020-10-01 15:41:01 +02:00
Miriam Baglioni 83ea746163 added check to the test 2020-10-01 15:40:28 +02:00
Claudio Atzori 2e9e13444d author pids made unique by value 2020-10-01 12:50:40 +02:00
Miriam Baglioni 6e5db85b32 - 2020-10-01 11:51:11 +02:00
Miriam Baglioni a46179f61c refactoring 2020-10-01 11:22:01 +02:00
Miriam Baglioni b90bee124b removing raws that are empy from thos imported 2020-10-01 11:16:49 +02:00
Miriam Baglioni c107f193c9 refactoring 2020-10-01 11:16:22 +02:00
Claudio Atzori e265c3e125 cleaning functions factored out in a dedicated class 2020-10-01 10:50:15 +02:00
Miriam Baglioni 706a80a29a added test to check that separator '-' (not hyphen) will be recognized 2020-10-01 10:38:31 +02:00
Miriam Baglioni 3dca586b3b refactoring 2020-10-01 10:34:48 +02:00
Miriam Baglioni 416bda6066 changed the programme.desxcription by using the same value used in the classification instead of the short title or the title 2020-10-01 10:31:33 +02:00
Miriam Baglioni f6587c91f3 added comparison to a char that seems - but it is not 2020-10-01 10:30:26 +02:00
Claudio Atzori 4287164aba include relevantdate field in the result view 2020-10-01 10:28:55 +02:00
miconis e3f7798d1b minor changes in dedup tests, bug fix in the idgenerator and pace-core version update 2020-09-29 15:31:46 +02:00
Miriam Baglioni 7e73bb88b3 changed the logic to add the topic description to the project 2020-09-28 17:21:43 +02:00
Miriam Baglioni 0a035e3630 - 2020-09-28 17:20:49 +02:00
Miriam Baglioni 16bee2084d added the topic code to the project subset 2020-09-28 17:20:11 +02:00
Miriam Baglioni 0bf2d0db52 added to the workflow the download of the topic excel file and one property needed to get the input path of the topic file in the hdfs filesystem 2020-09-28 12:17:22 +02:00
Miriam Baglioni c2abde4d9f changed the implementation of Atomic Actions creation by exploiting the topic information get from the cordis excel file 2020-09-28 12:16:34 +02:00
Miriam Baglioni d930b8d3fc changed the query to get only the code of the project and not the optional1 (topic code) and optional2 (topic description) 2020-09-28 12:15:48 +02:00
Miriam Baglioni f8f5cfd5cc removed the part added to set the topic code and description in the step of project preparation 2020-09-28 12:13:33 +02:00
Miriam Baglioni 9e19c9a221 remove the topic description from the values in the CSVProject class 2020-09-28 12:11:03 +02:00
Miriam Baglioni 6d8b932e40 refactoring 2020-09-28 12:06:56 +02:00
Miriam Baglioni b77f166549 changed the package name from csvutils to utils 2020-09-28 12:05:47 +02:00
Miriam Baglioni e33e3277de added needed dependency to read the excel file 2020-09-28 12:03:14 +02:00
Miriam Baglioni f4739a371a code to get the information related to the topic association between code and description. 2020-09-28 12:02:48 +02:00
Miriam Baglioni 7b6a7333e6 merge branch with master 2020-09-25 16:42:07 +02:00
Miriam Baglioni 983a12ed15 temporary modification to allow the upload of files in the sandbox without the neew to recreate the mapping from scratch 2020-09-25 16:41:51 +02:00
Miriam Baglioni 8b36d19182 added property depositionId and chenage property newVersion that became string from boolean to handle the three possible distinct values 2020-09-25 16:41:15 +02:00
Miriam Baglioni ed5239f9ec added new code to handle the new possibility to upload files to an already open deposition 2020-09-25 16:34:32 +02:00
Miriam Baglioni 3a8c524fce refactor 2020-09-25 16:34:02 +02:00
Miriam Baglioni 2ac2b537b6 merge branch with master 2020-09-25 14:40:47 +02:00
Miriam Baglioni 54800fb9b0 enabled only the step to upload in zenodo 2020-09-25 14:40:22 +02:00
Miriam Baglioni 12c2dfc268 modified the resource to consider the information added to the model 2020-09-25 14:17:23 +02:00
Miriam Baglioni 969fa8d96e fixed issue and changed the transformation of the programme file to consider the new model 2020-09-25 13:32:34 +02:00
miconis 4cf79f32eb implementation of the oozie wf to prepare the openorgs input: relations between organizations 2020-09-25 11:29:51 +02:00
Michele Artini c171fdebe1 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-09-25 09:03:09 +02:00
Michele Artini c96598aaa4 opendoar partition 2020-09-25 09:02:58 +02:00
Miriam Baglioni de6c4d46d8 fixed conflicts 2020-09-24 15:35:01 +02:00
Miriam Baglioni e917281822 - 2020-09-24 15:24:05 +02:00
Miriam Baglioni 9f54f69e6d added topic information 2020-09-24 15:23:35 +02:00
Miriam Baglioni d6206d6e63 add the topic description to the action set associated to the project 2020-09-24 15:22:40 +02:00
Miriam Baglioni 6b50226f3b added topic code and topic description 2020-09-24 15:21:49 +02:00
Miriam Baglioni 15af1f527e modified to consider the topic information 2020-09-24 15:20:56 +02:00
Miriam Baglioni 609ff17cfc now the commission give us the framework programme (FP7 - H2020) so use this information to filter out programmes not associated to H2020 2020-09-24 15:19:31 +02:00
Miriam Baglioni b66f930466 Added optionl1 and optional2 information to the files red from the db. Optional1 contains the topic code and optional2 contains the topic description 2020-09-24 15:16:56 +02:00
Miriam Baglioni 860e6d38a6 added topic description to the CSV project variables 2020-09-24 15:15:26 +02:00
Claudio Atzori 044d3a0214 fixed query used to load datasources in the Graph 2020-09-24 13:48:58 +02:00
Claudio Atzori 27df1cea6d code formatting 2020-09-24 12:16:00 +02:00
Claudio Atzori fb22f4d70b included values for projects fundedamount and totalcost fields in the mapping tests. Swapped expected and actual values in junit test assertions 2020-09-24 12:10:59 +02:00
Claudio Atzori 42f55395c8 fixed order of the ISSNs returned by the SQL query 2020-09-24 12:09:58 +02:00
Claudio Atzori fadf5c7c69 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-09-24 10:42:52 +02:00
Claudio Atzori 9a7e72d528 using concat_ws to join textual columns from PSQL. When using || to perform the concatenation, Null columns makes the operation result to be Null 2020-09-24 10:42:47 +02:00
Claudio Atzori 9e3e93c6b6 setting the correct issn type in the datasource.journal element 2020-09-24 10:39:16 +02:00
Miriam Baglioni 0d83f47166 merge branch with master 2020-09-23 17:33:49 +02:00
Miriam Baglioni 39eb8ab25b changed the dump to move from h2020programme to h2020classification 2020-09-23 17:33:00 +02:00
Miriam Baglioni 1d84cf19a6 added new line to resource file 2020-09-23 17:32:22 +02:00
Miriam Baglioni f0c476b6c9 modification to the test classes to consider h2020classification 2020-09-23 17:31:49 +02:00
Miriam Baglioni 2cba3cb484 modification to the classes building the actionset to consider the h2020classification 2020-09-23 17:31:15 +02:00
Miriam Baglioni 1069cf243a modification to the schema to consider the H2020classification of the programme. The filed Programme has been moved inside the H2020classification that is now associated to the Project. Programme is no more associated directly to the Project but via H2020CLassification 2020-09-22 14:38:00 +02:00
Enrico Ottonello a97ad20c7b exception is now propagated (PR review) 2020-09-22 10:46:34 +02:00
Enrico Ottonello fefbcfb106 dependency version moved to main pom (PR review) 2020-09-22 10:20:25 +02:00
miconis 259362ef47 implementation of the job to collect simrels from postgres db 2020-09-22 09:43:27 +02:00
Michele Artini 9e681609fd stats to sql file 2020-09-17 15:51:22 +02:00
Michele Artini 51321c2701 partition of events by opedoarId 2020-09-17 11:38:07 +02:00
Claudio Atzori cf2ce1a09b code formatting 2020-09-15 15:58:03 +02:00
Enrico Ottonello 9e8e7fe6ef add comments 2020-09-15 11:32:49 +02:00
Miriam Baglioni c2b5c780ff - 2020-09-14 14:34:03 +02:00
Miriam Baglioni e2ceefe9be - 2020-09-14 14:33:28 +02:00
Miriam Baglioni 1f893e63dc - 2020-09-14 14:33:10 +02:00
Enrico Ottonello 538f299767 merged 2020-09-14 12:35:16 +02:00