Commit Graph

165 Commits

Author SHA1 Message Date
Sandro La Bruzzo 7387416e90 added params skip update to direct transform in OAF, this should be set to true in production 2021-10-12 12:36:30 +02:00
Sandro La Bruzzo 511da98d0c - fixed bug on download pmc Article
- removed unused line of code in SparkCreateActionset
2021-10-12 11:47:49 +02:00
Sandro La Bruzzo 5606014b17 code refactor see ticket #7065 2021-10-12 08:11:53 +02:00
Miriam Baglioni 5ec69889db OpenCitations: creation of AS from OC 2021-09-27 16:02:06 +02:00
Miriam Baglioni f2118d771a first steps in the implementation of the integration of opencitations 2021-09-22 15:18:05 +02:00
Sandro La Bruzzo 9f8a80deb7 fixed wrong import of unresolved relation in openaire 2021-09-01 14:16:27 +02:00
Miriam Baglioni ab8abd61bb GetCSV refactoring - refactoring due to movement of classes 2021-08-12 18:11:07 +02:00
Miriam Baglioni 1d6ac3715b merge branch with beta 2021-07-30 11:58:29 +02:00
Sandro La Bruzzo b1b0cc3f15 fixed wrong package name 2021-07-29 13:55:08 +02:00
Sandro La Bruzzo 3721df7aa6 refactoring create actionset of scholexplorer, moved on package dhp-aggregation 2021-07-29 10:45:35 +02:00
Miriam Baglioni 708d0ade34 Merge branch 'beta' into hostedbymap 2021-07-28 10:37:22 +02:00
Sandro La Bruzzo 825d9f0289 fixed datacite workflow starting from Importing delta 2021-07-27 16:09:46 +02:00
Miriam Baglioni 63553a76b3 added code to download gold issn list from unibi 2021-07-22 12:01:48 +02:00
Sandro La Bruzzo cd17e19044 implemented branch workflow to import datacite and crossref in scholexplorer 2021-07-08 21:20:19 +02:00
Sandro La Bruzzo 0cdb7ccdaa added inverse relations to datacite mapping 2021-06-04 15:10:20 +02:00
Sandro La Bruzzo 02ef46535f Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-05-31 09:50:15 +02:00
Sandro La Bruzzo aeadc5a366 updated wf Datacite Import to retrieve the block size as parameter 2021-05-31 09:49:53 +02:00
Claudio Atzori d512062b58 integrating pull #109, H2020Classification 2021-05-27 12:22:47 +02:00
Sandro La Bruzzo bced804151 updated wf Datacite Import to retrieve the block size as parameter 2021-05-26 17:06:50 +02:00
Miriam Baglioni c844877de2 changed workflow flow to possibly parallelize also the programme and project preparation steps 2021-05-21 14:41:57 +02:00
Miriam Baglioni 54f6e2f693 changed to get the needed information to build the action set as parallel jobs 2021-05-21 11:47:00 +02:00
Miriam Baglioni 9610224671 added param to workflow property 2021-05-20 18:21:12 +02:00
Claudio Atzori b695932ae4 integrated pull#108 2021-05-20 15:34:04 +02:00
Miriam Baglioni dc0ad8d2e0 fixed issue related to change in the file name downloaded. Added sheet name as parameter and also a check if the name should change 2021-05-20 14:53:53 +02:00
Claudio Atzori 239d0f0a9a ROR actionset import workflow backported from branch stable_ids 2021-05-18 16:12:11 +02:00
Michele Artini a278d67175 parse input file 2021-04-29 11:34:47 +02:00
Michele Artini b5cf505cc6 partial implementation of the ROR->actionset workflow 2021-04-28 16:00:24 +02:00
Sandro La Bruzzo fd29307b84 updated workflow name 2021-04-21 09:21:41 +02:00
Sandro La Bruzzo e06c7f32f6 updated id figshare as described in #6377 2021-04-20 10:18:07 +02:00
Sandro La Bruzzo cdfe01bbae improved parallelization on transformation job 2021-04-19 15:14:52 +02:00
Sandro La Bruzzo 616d2ecce2 splitted workflow collecting datacite into two workflows.
Released on beta
2021-03-31 15:45:58 +02:00
Sandro La Bruzzo 1dfda3624e improved workflow importing datacite 2021-03-26 13:56:29 +01:00
Claudio Atzori 58467aaf1e WIP: transformation workflow error reporting 2021-02-17 16:14:41 +01:00
Claudio Atzori 1abe6d1ad7 WIP: collectorWorker error reporting, added report messages 2021-02-15 15:08:59 +01:00
Claudio Atzori 29c6f7e255 classes related to the collection workflow moved into common package; implemented MongoDB collection plugins 2021-02-12 12:31:02 +01:00
Claudio Atzori bae029f828 collection_java_xmx allows to declare the heap size allocated for the java actions involved in the metadata collectionw workflow 2021-02-08 18:07:23 +01:00
Claudio Atzori 50add4c61b added requestDelay to HttpConnector2 configuration; Aggregation workflow constants moved in dhp-common 2021-02-08 12:19:38 +01:00
Claudio Atzori a8a758925e better logging, WIP: collectorWorker error reporting 2021-02-05 19:18:05 +01:00
Sandro La Bruzzo 4dae5e605d implemented messaging btween collection worker and Dnet 2021-02-04 15:51:15 +01:00
Claudio Atzori e04045089f better logging, WIP: collectorWorker error reporting 2021-02-03 17:58:22 +01:00
Claudio Atzori 53884d12c2 code formatting 2021-02-02 14:38:03 +01:00
Sandro La Bruzzo 0634674add implemented transformation test 2021-02-02 12:12:14 +01:00
Sandro La Bruzzo 6ff234d81b Implemented a first prototype of incremental harvesting and trasformation using readlock 2021-02-01 13:56:05 +01:00
Sandro La Bruzzo e423634cb6 RollBack in case of error WORKS!!! 2021-01-29 17:21:42 +01:00
Sandro La Bruzzo 0276180039 WIP mdstore
transaction implemented on hadoop side
2021-01-29 16:42:41 +01:00
Sandro La Bruzzo 0f8e2ecce6 Merged Datacite transfrom into this branch 2021-01-29 10:45:07 +01:00
Sandro La Bruzzo 99cf3a8ea4 Merged Datacite transfrom into this branch 2021-01-28 16:34:46 +01:00
Sandro La Bruzzo 98b9498b57 Removed old messaging system not quite used from collection and Transformation workflow
code refactor
2021-01-28 09:51:17 +01:00
Sandro La Bruzzo 184e7b3856 Implemented new Transformation using spark 2021-01-27 15:43:08 +01:00
Claudio Atzori 41500669e2 [BIP! Scores integration] merged missing classes from bipFinder branch 2021-01-11 14:39:47 +01:00
Claudio Atzori 03319d3bd9 Revert "Merge pull request 'Creation of the action set to include the bipFinder! score' (#62) from miriam.baglioni/dnet-hadoop:bipFinder into master"
This reverts commit add7e1693b, reversing
changes made to f9a8fd8bbd.
2020-12-17 12:23:58 +01:00
Miriam Baglioni 3d62d99d5d fixed issue in workflow variable 2020-12-01 15:02:49 +01:00
Miriam Baglioni 62ff4999e3 added workflow and last step of collection and save 2020-12-01 14:30:56 +01:00
Miriam Baglioni 45d06c45c7 collecting all the atoic actions for result type and save them all in the AS path 2020-12-01 14:29:18 +01:00
Miriam Baglioni db36e11912 classes test classes and resources for production of the actionset to include bipFinder score in results 2020-11-30 20:14:23 +01:00
Miriam Baglioni 43cbd62c2b added classpath.first in the configuration 2020-10-01 15:46:34 +02:00
Miriam Baglioni cd69c6b023 added dependency for the topic file path 2020-10-01 15:45:59 +02:00
Miriam Baglioni 0bf2d0db52 added to the workflow the download of the topic excel file and one property needed to get the input path of the topic file in the hdfs filesystem 2020-09-28 12:17:22 +02:00
Miriam Baglioni 782984d8e5 added needed parameter 2020-05-28 23:52:41 +02:00
Miriam Baglioni 773735f870 added the path to the file containing the projects code from the db 2020-05-28 17:30:45 +02:00
Miriam Baglioni 6a15067a64 added one step in the workflow 2020-05-28 17:30:09 +02:00
Miriam Baglioni b737ed8236 added part to read projects from the openaire db to filter out those in the csv file that are not in the db 2020-05-28 17:29:21 +02:00
Miriam Baglioni 1060977272 added fs actions to remove and the create the workingDir 2020-05-28 10:04:36 +02:00
Miriam Baglioni 1855453434 changed the outputdir of the last step 2020-05-27 17:59:36 +02:00
Miriam Baglioni 4589c428b1 generate action sets and saves them in the hdfs path for the actions sets 2020-05-21 16:30:39 +02:00
Miriam Baglioni eb0e47ba53 parameters for h2020 programme 2020-05-20 10:26:44 +02:00
Miriam Baglioni 08218d2f3f new workflow with added steps 2020-05-19 18:44:25 +02:00
Miriam Baglioni 9447d78ef3 added preparation classes 2020-05-19 18:42:50 +02:00
Miriam Baglioni f0f14caf99 removed script files for shell actions not performed 2020-05-18 13:06:16 +02:00
Miriam Baglioni 23bbac7d7c - 2020-05-18 13:05:03 +02:00
Miriam Baglioni abc45f2708 added dnet-45 HttpConnector and related Classes, produced the POJO for projects and programme 2020-05-18 13:04:06 +02:00
Miriam Baglioni 5a648016ef parameters from the GetFile class 2020-05-15 18:18:50 +02:00
Miriam Baglioni 83c262a483 workflow to download the files 2020-05-15 18:18:31 +02:00
Miriam Baglioni 22cb9e0da7 simple code to get file from URL 2020-05-15 18:18:01 +02:00
Claudio Atzori 0825321d0b improved unit tests in dhp-aggregation 2020-05-05 12:39:04 +02:00
Claudio Atzori 9147af7fed actionsets migration workflow moved in dhp-workflows/dhp-actionmanager 2020-04-20 15:24:33 +02:00
Claudio Atzori 6b5f9ca9cb raw graph creation workflow moved under dhp-graph-mapper, claims integration is included 2020-04-10 17:53:07 +02:00
Michele Artini fd57722c69 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-25 15:56:49 +01:00
Michele Artini 0fda2c3a30 some tests on db records 2020-03-25 09:43:58 +01:00
Michele Artini e3760c7f39 fix a bug with organization countries 2020-03-24 08:43:56 +01:00
Claudio Atzori 36236dd1c1 action migration workflow produces eu.dnetlib.dhp.schema.action.AtomicAction(s) 2020-03-19 14:00:38 +01:00
Claudio Atzori abe8fb69a2 added global properties, moved postprocessing script inside the oozie_app directory 2020-03-18 15:43:54 +01:00
Claudio Atzori 2f11e37602 fixed expansion of path variables 2020-03-17 19:41:07 +01:00
Claudio Atzori 2795b0b096 no need to mkdir a the all_entities file 2020-03-17 17:22:14 +01:00
Claudio Atzori 19746ad308 when reuseContent, reset ${workingPath}/all_entities 2020-03-17 17:17:06 +01:00
Claudio Atzori 2f0c85eeb3 updated parameters for regular_all_steps worfklow, introduced flag 'reuseContent' 2020-03-17 17:04:58 +01:00
Claudio Atzori b8290b5851 updated parameters for regular_all_steps worfklow 2020-03-17 15:45:30 +01:00
Claudio Atzori 4706f24ec5 updated parameters for regular_all_steps worfklow 2020-03-17 15:23:54 +01:00
Claudio Atzori 9c84e21b87 added workflow to migrate latest version of each actionset content from DM to OCEAN cluster, mapping the targetValues from the old protobuf data model to the dhp.OAF datamodel 2020-03-13 15:56:52 +01:00
Michele Artini b6efa9d6ab Configuration of the SequenceFile Writer 2020-03-05 15:49:14 +01:00
Michele Artini 4b29a121b0 migration using spark in step2 2020-03-02 16:12:14 +01:00
Michele Artini 5445a57102 migration using spark in step2 2020-03-02 16:11:59 +01:00
Michele Artini 93665773ea Fixed a problem with JavaRDD Union 2020-02-25 15:59:21 +01:00
Michele Artini 5d3739b5cf migration of claims 2020-02-19 15:11:17 +01:00
Michele Artini 173f1df1e5 saved a query for openaire production database 2020-02-19 10:15:08 +01:00
Sandro La Bruzzo 9a2d74ac82 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-19 10:13:45 +01:00
Sandro La Bruzzo e5d7cdf422 fixed sql query 2020-02-19 10:13:36 +01:00
Claudio Atzori 6a288625e5 fixed workflow outgoing node 2020-02-17 15:04:33 +01:00
Sandro La Bruzzo 76ee85141a added oozie job for DNET migration and implemented Spark job for extracting entities 2020-02-17 12:31:44 +01:00
Michele Artini cdea0dae75 bug fixing 2020-02-12 16:34:00 +01:00