Commit Graph

1999 Commits

Author SHA1 Message Date
Claudio Atzori faa977df7e Merge pull request 'orcid-no-doi' (#43) from enrico.ottonello/dnet-hadoop:orcid-no-doi into master
The dataset was generated and is now part of the actionsets available in BETA
2020-12-02 10:55:12 +01:00
Claudio Atzori 57f448b7a4 graph cleaning workflow separate orcid_pending from orcid, depending on the author pid provenance 2020-12-02 10:44:05 +01:00
Alessia Bardi 2d15667b4a testing XML generation from json object (case AMS ACTA) 2020-12-02 10:16:26 +01:00
Alessia Bardi a417624670 tests for raw graph mapping 2020-12-02 10:15:26 +01:00
Enrico Ottonello f2df3ead74 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-30 14:22:46 +01:00
Enrico Ottonello 40c4559e92 added datainfo on authors pid with "sysimport:crosswalk:entityregistry", 2020-11-30 14:19:22 +01:00
Claudio Atzori e731a7658d cleaning texts to remove tab characters too 2020-11-27 09:00:04 +01:00
Claudio Atzori a104d2b6ad cleanup 2020-11-26 11:12:00 +01:00
Claudio Atzori db0181b8af Merge pull request 'added bidirectionality to relations from project and result coming from crossref' (#60) from miriam.baglioni/dnet-hadoop:sxBidirectionality into master 2020-11-25 17:17:40 +01:00
Sandro La Bruzzo ec3e238de6 Fixed problem on duplicated identifier 2020-11-25 17:15:54 +01:00
Sandro La Bruzzo 264723ffd8 updated stuff for zenodo upload 2020-11-25 11:56:07 +01:00
Claudio Atzori eeebd5a920 Cleanig workflow: remove newlines from titles, descriptions, subjects 2020-11-24 18:40:25 +01:00
Enrico Ottonello 99a086f0c6 max concurrent executors set to 10, according to ORCID Director of Technology mail request 2020-11-24 17:49:32 +01:00
Miriam Baglioni 00874a8ce6 added bidirectionality to relations from project and result 2020-11-24 15:17:23 +01:00
Enrico Ottonello 5c17e768b2 set wf configuration with spark.dynamicAllocation.maxExecutors 20 over 20 input partitions 2020-11-23 16:01:23 +01:00
Enrico Ottonello 5c9a727895 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-23 09:49:53 +01:00
Enrico Ottonello 97c8111847 action to convert lambda file in seq file; spark action to download updated authors 2020-11-23 09:49:22 +01:00
Claudio Atzori d48f388fb2 Merge branch 'provision_indexing' 2020-11-19 15:59:55 +01:00
Claudio Atzori 46bde9c13f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-19 15:26:27 +01:00
Claudio Atzori 7c9feaf9e7 project attributes removed from the XML record serialization: contactfullname, contactfax, contactphone, contactemail 2020-11-19 15:26:20 +01:00
Michele Artini 293da47ad9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-11-19 10:42:31 +01:00
Michele Artini ab08d12c46 considering abstract > MIN_LENGTH in ENRICH_MISSING_ABSTRACT 2020-11-19 10:42:10 +01:00
Claudio Atzori e503271abe fixed notification workflow name 2020-11-19 10:41:38 +01:00
Claudio Atzori 0374d34c3e introduced configuration param outputFormat: HDFS | SOLR 2020-11-19 10:34:28 +01:00
Claudio Atzori ede7fae6c8 Merge pull request 'XML record indexing test' (#58) from provision_indexing into master 2020-11-18 17:04:34 +01:00
Claudio Atzori 5218718e8b updated set of fields from the MDFormatDSResourceType on PROD 2020-11-18 15:00:41 +01:00
Claudio Atzori d9e07a242b extended XmlIndexingJob to accept an optional parameter: outputPath. When present, forces the job to write its output on the specified HDFS location 2020-11-18 14:34:55 +01:00
Claudio Atzori 29dcff0f34 spark complains about missing classes, so here they are again 2020-11-18 14:32:32 +01:00
Claudio Atzori 12acf25519 Merge pull request 'starting from first step...' (#57) from antonis.lempesis/dnet-hadoop:master into master
No judging. Just re-deploying...
2020-11-18 11:01:49 +01:00
Claudio Atzori 8177ce7939 test for XmlIndexingJob based on a local miniSolrCluster 2020-11-18 10:58:05 +01:00
Alessia Bardi 10e673660f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-18 10:01:23 +01:00
Alessia Bardi be7b310cef rel semantcis ignore case 2020-11-18 10:01:20 +01:00
Michele Artini 33da2e3d6c xpaths for dateOfCollection and dateOfTransformation 2020-11-18 09:26:20 +01:00
Antonis Lempesis 01a6e03989 starting from first step... 2020-11-17 23:26:47 +02:00
Alessia Bardi 8f87020a50 #56: map relevantDates from aggregated ODF records 2020-11-17 18:42:09 +01:00
Alessia Bardi 7e0a76a8ac test fr TextGrid 2020-11-17 18:39:25 +01:00
Enrico Ottonello 2b0c9bbb7e Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-17 18:24:34 +01:00
Enrico Ottonello c0c2e05eae added wf to extracting authors and works xml data from orcid dump to hdfs; added wf to download the lamda file (containing last orcid update informations) from orcid to hdfs 2020-11-17 18:23:12 +01:00
Claudio Atzori cfc01f136e PID filtering based on a blacklist 2020-11-17 12:27:06 +01:00
Claudio Atzori 628ca54dd3 disable old maven repository URLs 2020-11-17 12:26:16 +01:00
Enrico Ottonello c796adae24 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-16 11:57:19 +01:00
Claudio Atzori 6ab1ce53c9 fixed condition in result pid cleaning; cleanup 2020-11-16 10:09:17 +01:00
Claudio Atzori 4de8c8b237 fixed workflow variable name 2020-11-16 10:03:11 +01:00
Claudio Atzori 331d621800 added test resource 2020-11-14 12:16:15 +01:00
Claudio Atzori 5d4e34e26a fixed typo in variable name 2020-11-14 10:32:26 +01:00
Claudio Atzori 768bc5304c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-13 15:40:34 +01:00
Claudio Atzori 93f7b7974f Merge pull request 'trust truncated to 3 decimals' (#24) from trunc_trust into master
LGTM
2020-11-13 15:40:02 +01:00
Claudio Atzori 2facfefc19 updated maven repository URL 2020-11-13 15:38:40 +01:00
Claudio Atzori 528231a287 grouping graph entities by id turned out to be an easy extension for the already existing cleaning workflow 2020-11-13 15:37:48 +01:00
Enrico Ottonello 005f849674 added compression to output dataset 2020-11-13 12:45:31 +01:00