Commit Graph

1049 Commits

Author SHA1 Message Date
Alessia Bardi 27af5122d2 logs for non well formed XML files 2022-09-12 14:25:23 +02:00
Claudio Atzori ff6f789b6d code formatting 2022-09-09 15:16:31 +02:00
Claudio Atzori b5d6966c01 Merge branch 'beta' into clean_country 2022-09-09 12:20:19 +02:00
Claudio Atzori b5f7bd30be Merge branch 'beta' into clean_subjects 2022-09-09 12:20:04 +02:00
Alessia Bardi a539c6ccaf https for handle URLs 2022-09-09 12:16:28 +02:00
Claudio Atzori 14dc909a14 Merge branch 'beta' into clean_country 2022-09-09 10:38:17 +02:00
Alessia Bardi 9ef063d502 #7861#note-8 instance url from handle 2022-09-07 17:29:54 +03:00
Claudio Atzori adb526b0e1 Merge branch 'beta' into clean_subjects 2022-08-12 10:51:17 +02:00
Claudio Atzori cb7c07c54e [scholix] added step to create tar archive 2022-08-11 11:25:24 +02:00
Claudio Atzori 2aa16d0432 [scholix] fixed OpenCitation dump procedure 2022-08-10 17:39:29 +02:00
Miriam Baglioni 7dbdd4a0fe [Clean Country]changes related to #241 (comment) 2022-08-10 15:13:10 +02:00
Claudio Atzori 51ad93e545 [scholix] fixed OpenCitation dump procedure 2022-08-10 11:57:56 +02:00
Miriam Baglioni 62d2138806 [Clean Context] changed a bit the logic. Added the check not to have result hosted by a datasource of type institutional repository from NL. Added also the check that the country should have been included in the result via propagation for it to be removed 2022-08-08 14:10:47 +02:00
Claudio Atzori 3418ce50ac cleaning of subjects: perform the cleaning when the given value is equivalent to one of the terms in the vocabulary 2022-08-08 12:48:47 +02:00
Miriam Baglioni 390013a4b2 mergin with branch beta 2022-08-08 12:30:31 +02:00
Claudio Atzori 4eaa063b1f cleaning of subjects 2022-08-05 16:56:09 +02:00
Claudio Atzori 32cee1f619 WIP: cleaning of subjects 2022-08-05 12:32:08 +02:00
Claudio Atzori 6c0fd9284b merge from beta 2022-08-05 10:42:53 +02:00
Claudio Atzori b78889a0ce WIP: cleaning of subjects 2022-08-05 09:11:37 +02:00
Miriam Baglioni a7a18d7630 [Graph Dump] removed code for the dump from the project. Fixed issues in tests when possible 2022-08-04 17:40:40 +02:00
Claudio Atzori 27a91841e7 WIP: cleaning of subjects 2022-08-04 11:39:39 +02:00
Claudio Atzori 1dd1e4fe3a extended test for mapping project_organization relations 2022-07-28 11:27:08 +02:00
Claudio Atzori 09ccc7b472 Merge branch 'beta' into project_organization_contribution 2022-07-28 09:49:59 +02:00
Miriam Baglioni 5968ec018d [Clean Country] modified workflow and added param file 2022-07-22 16:48:38 +02:00
Miriam Baglioni a12d28c644 [Clean Country] added logic not to remove country from result if it exist a hosting datasource with that country. Moreover the country will be removed only if added with propagation 2022-07-22 16:23:12 +02:00
Miriam Baglioni 2c933f1158 mergin with branch beta 2022-07-22 14:57:41 +02:00
Sandro La Bruzzo ddc414b258 fixed wrong json param 2022-07-22 09:43:15 +02:00
Sandro La Bruzzo 5f651f2316 changed filter relation on SubRelType 2022-07-21 10:11:48 +02:00
Miriam Baglioni 65cc736e2f [Clean Country] first implementation to remove country NL from results collected from NARCIS when doi starts with mendely prefix 2022-07-20 17:05:56 +02:00
Sandro La Bruzzo 5b76321d9c implemented oozie workflow to generate scholix dump filtering relclass semantic 2022-07-20 16:34:32 +02:00
Claudio Atzori 1138b2ac8e code formatting 2022-07-19 14:15:49 +02:00
Claudio Atzori 0c1cfee396 mapping oaf:fulltext elements in the result.fulltext field 2022-07-11 17:34:59 +02:00
Claudio Atzori 0cb1c70788 code formatting 2022-07-01 10:44:08 +02:00
Claudio Atzori 4ec13e2b66 Merge branch 'master' into dump_new_funded_products 2022-07-01 10:30:28 +02:00
Claudio Atzori 7da24c1dec added more logging 2022-06-28 13:47:49 +02:00
Miriam Baglioni 71744a1f52 [DUMP DELTA PROJECTS] refactoring 2022-06-27 18:07:58 +02:00
Claudio Atzori a8773af0cb Merge branch 'beta' into project_organization_contribution 2022-06-27 09:37:40 +02:00
Claudio Atzori 5130eac247 mapping by participant project contribution 2022-06-24 17:16:42 +02:00
Miriam Baglioni f561f13dd9 [Funder Products Dump] fixed names of parameters in workflow 2022-06-21 18:18:17 +02:00
Miriam Baglioni b98f904d48 [Funder Products Dump] new way to avoid using hive 2022-06-21 17:52:27 +02:00
Miriam Baglioni 7423577a08 [Graph DUMP] add code to produce the delta of new projects with respect to the previous delta/dump 2022-06-21 14:51:38 +02:00
Claudio Atzori b295a40d9c restored use of name_particles when parsing author names 2022-06-16 12:20:43 +02:00
Claudio Atzori 4c8e820ff0 mapping relationship from trasformed records based on oaf:relation 2022-06-14 08:49:02 +02:00
Claudio Atzori 116902c028 mapping relationship from trasformed records based on oaf:relation 2022-06-13 14:31:48 +02:00
Claudio Atzori 52cb086506 [graph grouping] drop relation target path before copying from source 2022-05-16 12:08:36 +02:00
Claudio Atzori 997c50078e [graph grouping] drop relation target path before copying from source 2022-05-16 12:07:40 +02:00
Claudio Atzori 6031acb2e3 [openorgs] fixed parent/child query, using the correct semantic labels 2022-05-16 09:20:48 +02:00
Claudio Atzori 0dc33ea391 [openorgs] fixed parent/child query, using the correct semantic labels 2022-05-16 09:20:30 +02:00
Miriam Baglioni e4eac1d20b [EOSC TAG] added code to remove EOSC Jupyter Notebook from subjects and put EOSC as classid in the qualifier 2022-05-13 11:01:33 +02:00
Sandro La Bruzzo 22f65680b9 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2022-05-11 15:30:12 +02:00
Sandro La Bruzzo ca8d26bcb4 added better filter for openCitations 2022-05-11 15:29:57 +02:00
Claudio Atzori 5d3b4a9c25 [graph merge beta] merge datasource originalid, collectedfrom, and pid lists 2022-05-11 14:13:06 +02:00
Claudio Atzori 2a8e0fb72f [openorgs] mapping parent/child relations without massaging the semantic labels 2022-05-10 08:45:53 +02:00
Claudio Atzori 77bc9863e9 [openorgs] mapping parent/child relations without massaging the semantic labels 2022-05-09 16:06:04 +02:00
Claudio Atzori 846975c886 [eosc_services] using the correct 'keyword' subject type, as declared in the dnet:subject_classification_typologies vocabulary 2022-05-05 11:37:58 +02:00
Claudio Atzori da611cfbbd [eosc_services] resolved merge conflicts 2022-05-03 13:37:15 +02:00
Claudio Atzori 2ade69dea6 EOSC Services - minor 2022-05-02 17:03:31 +02:00
Claudio Atzori b6a7ff3a99 EOSC Services - removed fields from mapping, testing preparation 2022-05-02 15:52:33 +02:00
Claudio Atzori a8c51f6f16 EOSC Services - fixed query and testing preparation 2022-05-02 11:09:03 +02:00
Claudio Atzori f5f532d134 EOSC Services - ongoing update 2022-04-29 12:25:24 +02:00
Claudio Atzori 5ffc24d1ba EOSC Services - ongoing update 2022-04-26 16:18:41 +02:00
Miriam Baglioni 19d90658fc [Clean Context] added description to parameters 2022-04-22 15:41:23 +02:00
Miriam Baglioni e0915061c2 [Clean Context] fixed issue in param name 2022-04-21 16:32:40 +02:00
Miriam Baglioni 9a961a0092 [Clean Context] fixed issue in param name 2022-04-21 15:12:24 +02:00
Miriam Baglioni 5b7d9e741c [Clean Context] added logic to cleaning workflow to accomodate also context cleaning 2022-04-21 13:02:14 +02:00
Miriam Baglioni ccba1a3db1 [Clean Context] added logic to cleaning workflow to accomodate also context cleaning 2022-04-21 13:00:06 +02:00
Miriam Baglioni a38f0f5ea7 mergin with branch beta 2022-04-20 15:44:18 +02:00
Miriam Baglioni dbfbe8841a [Clean Context] changed the description in input parameters 2022-04-20 15:41:03 +02:00
Michele Artini c96a8613f8 update SQL queries 2022-04-20 12:07:49 +02:00
Michele Artini 4314db55c8 migration to services: update sql queries 2022-04-19 15:05:02 +02:00
Claudio Atzori 05fafa1408 [graph raw] avoid NPEs importing datasource consent fields 2022-04-06 15:23:50 +02:00
Claudio Atzori 8c457f1b2c conflicts resolved, merged from beta 2022-04-06 10:27:52 +02:00
Miriam Baglioni 79336d46c5 [Clean Context] first naive implementation of a functionality to clean not wanted contextes from one result. This implementation simply verifies the main title of the results start with a given string 2022-04-04 15:52:31 +02:00
Claudio Atzori 0a0ae84c22 [graph raw] DOI based instance URLs on https 2022-03-29 10:52:58 +02:00
Claudio Atzori 741bc99c47 Merge branch 'beta' into datasource_pdf_consent 2022-03-28 09:20:48 +02:00
Miriam Baglioni 89fd275480 [HostedByMap] added left over from PR and fixed issue on workflow 2022-03-21 09:54:45 +01:00
Miriam Baglioni 0f7d8ca2e0 [HostedByMap] change on master to align to PR 201 on beta merged as 9f3036c847 2022-03-11 15:16:02 +01:00
Claudio Atzori f25407bbe2 added mapping for datasource consent fields to integrate them in the graph 2022-03-11 09:32:42 +01:00
Miriam Baglioni 2c5087d55a [HostedByMap] download of doaj from json, modification of test resources, deletion of class no more needed for the CSV download 2022-03-04 15:18:21 +01:00
Miriam Baglioni 5d608d6291 [HostedByMap] changed the model to include also oaStart date and review process that could be possibly used in the future 2022-03-04 11:06:09 +01:00
Miriam Baglioni 8a41f63348 [HostedByMap] update to download the json instead of the csv 2022-03-04 10:38:43 +01:00
Miriam Baglioni 44b0c03080 [HostedByMap] update to download the json instead of the csv 2022-03-04 10:37:59 +01:00
Claudio Atzori 99f5b14469 [graph raw] invisible records stored among the raw graph rather than the claimed subgraph 2022-02-18 15:20:57 +01:00
Claudio Atzori cf8443780e added processingchargeamount to the result view 2022-02-18 15:17:48 +01:00
Alessia Bardi 600ede1798 serialisation of APCs int he XML records 2022-02-11 11:00:20 +01:00
Miriam Baglioni aae667e6b6 [APC at the result level] added the APC at the level of the result and modified test class 2022-02-04 12:34:25 +01:00
Claudio Atzori 1322379741 Merge branch 'beta' into delegated_authorities 2022-01-25 14:28:25 +01:00
Claudio Atzori 97ad94d7d9 [graph resolution] drop output path at the beginning 2022-01-24 18:02:07 +01:00
Claudio Atzori dd52bf1bb8 copy relations to the graphOutputPath 2022-01-21 13:59:29 +01:00
Claudio Atzori 4983d6536d Merge branch 'beta' into delegated_authorities 2022-01-21 13:02:48 +01:00
Claudio Atzori f0ea2410e5 improved mapping titles from datacite records to consider title types 2022-01-21 10:50:34 +01:00
Claudio Atzori 391aa1373b added unit test 2022-01-19 17:13:21 +01:00
Claudio Atzori 44a937f4ed factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources 2022-01-19 12:24:52 +01:00
Miriam Baglioni a7c4d0d16d [DoiBoost Organizations] added parameter to specify the action in the wf raw_organizations to be able to load the openorgs organization as in the loading step for the construction of the graph 2022-01-13 13:52:00 +01:00
Sandro La Bruzzo 57e2c4b749 formatted code 2022-01-12 09:40:28 +01:00
Claudio Atzori 4f212652ca scalafmt: code formatting 2022-01-11 16:57:48 +01:00
Claudio Atzori 908294d86e OAF-store-graph mdstores: firther fix for PR#180 2022-01-05 15:49:05 +01:00
Claudio Atzori a6977197b3 serialise records in the OAF-store-graph mdstores in json format. Read them again in the graph construction phase using a tolerant parser to support backward compatible changes in the evolution of the schema 2022-01-03 17:25:26 +01:00
Miriam Baglioni 31b26d48ac [Graph Dump] fixed issue on extraction of relation between entities and contexts: the relationship name and type were swapped 2021-12-23 10:09:47 +01:00
Miriam Baglioni be0acccf42 Merge branch 'beta' into dump 2021-12-22 12:39:57 +01:00
Miriam Baglioni 460e6b95d6 [Graph Dump] - 2021-12-21 13:48:03 +01:00
Sandro La Bruzzo 3920d68992 Fixed workflow generation of delta in datacite 2021-12-21 11:41:49 +01:00
Sandro La Bruzzo b881ee5ef8 [scholexplorer]
- implemented generation of scholix of delta update of datacite
2021-12-15 11:25:32 +01:00
Sandro La Bruzzo 63952018c0 [scholexplorer]
-moved SparkRetrieveDataciteDelta in scala folder
2021-12-15 11:25:32 +01:00
Sandro La Bruzzo e5bff64f2e [scholexplorer]
- Minor fix on SparkConvertRDDtoDataset
-first implementation of retrieve datacite dump
2021-12-15 11:25:32 +01:00
Miriam Baglioni 56409d1281 [Dump] resolved conflicts with beta and merging 2021-12-14 15:03:45 +01:00
Miriam Baglioni 8d755cca80 - 2021-12-13 15:01:40 +01:00
Claudio Atzori 41c70c607d cleaning workflow assigns the proper default instance type when a value could not be cleaned using the vocabularies 2021-12-09 16:44:28 +01:00
Sandro La Bruzzo bf880e2508 [scala-refactor] Module dhp-graph-mapper:
Moved all scala source into src/main/scala and src/test/scala
2021-12-06 13:57:41 +01:00
Miriam Baglioni 58bc3f223a [GRAPH DUMP] Add filtering for relation we do not want to dump. It is based on the relclass 2021-12-02 14:09:46 +01:00
Miriam Baglioni 8905a39bf3 mergin with branch beta 2021-12-02 13:17:29 +01:00
Claudio Atzori 3b19821f3c added stats computation on the graph hive DB tables 2021-12-02 10:44:10 +01:00
Claudio Atzori cfa4560769 minor: fixed hive action name 2021-12-02 10:43:36 +01:00
Claudio Atzori d85af6fc25 [cleaning wf] fixed OAF record navigation, a mapping defined on a container object would have prevented the natvigation to continue on its properties 2021-12-01 15:49:15 +01:00
Claudio Atzori 014e872ae1 [resolution wf] added optional parameter to skip the entity resolution 2021-11-26 15:38:56 +01:00
Sandro La Bruzzo 483d3039d1 entity resolution: added distcpt of missing entities in graph materialization 2021-11-22 15:55:24 +01:00
Sandro La Bruzzo 93fe8ce8b2 entity resolution: fix test 2021-11-22 15:50:43 +01:00
Sandro La Bruzzo 35e20b0647 updated resolution wf:
- generate a new version of the graph
 - changed merge from union to join
2021-11-22 11:48:55 +01:00
Miriam Baglioni 0506fa2654 [Graph Dump] changed to mirror the changes in the model 2021-11-19 15:56:25 +01:00
Miriam Baglioni 9fae872181 [Graph Dump] changed to mirror the changes in the model 2021-11-19 11:25:50 +01:00
Claudio Atzori bb5dca7979 cleanup 2021-11-18 17:10:46 +01:00
Miriam Baglioni 793b5a8e5f Aggiornare 'dhp-workflows/dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/dump/ResultMapper.java'
Removing the dump of Measure at the level of the result. We decided not to map it
2021-11-18 14:49:38 +01:00
Miriam Baglioni c6a9f0a1a8 mergin with branch beta 2021-11-16 12:04:40 +01:00
Miriam Baglioni 99d86134f5 [Graph Dump] changed the dump since the measures have been moded at the level of the instance 2021-11-16 12:04:21 +01:00
Claudio Atzori 668ac25224 [graph resolution] using existing argument parser file name 2021-11-15 17:02:45 +01:00
Claudio Atzori 7d0a03f607 [graph resolution] minor 2021-11-15 14:45:54 +01:00
Claudio Atzori 7c804acda8 [graph resolution] minor 2021-11-15 14:42:43 +01:00
Claudio Atzori d2c787d416 [graph resolution] fixed sequence of the workflow steps 2021-11-15 14:31:15 +01:00
Miriam Baglioni 6595135a1a [Dump Schemas] changed the schema of the dumped result according to the modifications in the bestAccessRight type 2021-11-12 11:45:38 +01:00
Miriam Baglioni 43cae4ad88 Merge branch 'dump' of https://code-repo.d4science.org/D-Net/dnet-hadoop into dump 2021-11-12 11:36:54 +01:00
Miriam Baglioni b3f9370125 merge with beta - resolved conflict in pom 2021-11-12 11:25:26 +01:00
Sandro La Bruzzo a7763d2492 removed alternate identifier in resolutionMap 2021-11-12 09:56:45 +01:00
Sandro La Bruzzo 2ca0a436ad added SparkResolveEntities node to the oozie wf 2021-11-11 10:25:42 +01:00
Sandro La Bruzzo 9cb195314f implemented and tested resolution of entities 2021-11-11 10:17:40 +01:00
Miriam Baglioni 8cc50ecee0 [Graph Dump] changed AccessRight with BestAccessRight in the dump and modified the dependency to the schema to the SNAPSHOT 2021-11-11 08:59:20 +01:00
Miriam Baglioni 88b73f4f49 mergin with branch beta 2021-11-10 17:00:52 +01:00
Sandro La Bruzzo 6477a40670 implement filter of openCitation 2021-11-09 11:27:12 +01:00
Miriam Baglioni 94918a673c [Graph DUMP] Fix issue for empty origilaId list 2021-11-08 10:25:28 +01:00
Miriam Baglioni 8442efd8d1 [Graph DUMP] Filtering out from the originalIds the id of the result in OpenAIRE 2021-11-05 12:29:22 +01:00
Claudio Atzori 5681e89544 Update 'dhp-workflows/dhp-graph-mapper/src/main/resources/eu/dnetlib/dhp/oa/graph/dump/schemas/result_schema.json' 2021-11-05 12:18:24 +01:00
Miriam Baglioni a22c29fba1 [Graph DUMP] Filtering out from the originalIds the id of the result in OpenAIRE 2021-11-05 12:08:33 +01:00
Miriam Baglioni c10ff6928c [Graph DUMP] add schema of the dump related to the model as in dhp-schemas.2.8.31. Note the measere element at the level of the result has been removed because of issues on where to display it: at the level of the result or at the level of the entity 2021-11-05 11:36:21 +01:00
Miriam Baglioni 0857849a86 [Graph DUMP] Remove dump of measure until it will be clear where to put it (at the level of result or at the level of the instance) 2021-11-05 11:02:37 +01:00
Sandro La Bruzzo 7bd224f051 implement first version of scholexplorer integration for the generation of final graph 2021-11-02 15:58:15 +01:00
Claudio Atzori 1225ba0b92 [resolution] increasing number of partitions to avoid OOM 2021-10-28 16:18:17 +02:00
Sandro La Bruzzo d9cbca83f7 moved filter on next phase 2021-10-28 16:13:24 +02:00
Sandro La Bruzzo 1be9aa0a5f Removed filter of datacite items from the raw graph merging phase, Datacite is not an actionset anymore in beta 2021-10-26 17:52:20 +02:00
Sandro La Bruzzo 4acfa8fa2e Scholexplorer Datasource Aggregation:
- Added collectedfrom in the inverse relation generated
Relation resolution:
- increased number of partitions in workflow.xml
- using classid instead of classname to build the pid-dnetId mapping
2021-10-26 17:51:20 +02:00
Sandro La Bruzzo 034304b33a conflict resolved on merge 2021-10-26 09:40:47 +02:00
Claudio Atzori d147295c2f avoiding java.io.NotSerializableException: java.util.HashMap 2021-10-21 14:15:57 +02:00