Giambattista Bloisi
6cc7d8ca7b
GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob
2023-08-30 10:43:31 +02:00
Claudio Atzori
bf35280ea6
code formatting
2023-08-29 11:11:00 +02:00
Giambattista Bloisi
95cd2b9b1e
Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
...
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi
fab9920271
DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag
2023-08-09 15:41:43 +02:00
Miriam Baglioni
c25ac21e5e
Merge pull request 'graph cleaning, suggestions from ticket 8898' ( #325 ) from cleaning_8898 into beta
...
Reviewed-on: D-Net/dnet-hadoop#325
2023-08-08 11:14:19 +02:00
Claudio Atzori
11ffb9bd68
rule out records with NULL dataInfo
2023-07-31 12:35:33 +02:00
Claudio Atzori
270df939c4
partial implementation of the suggestions from https://support.openaire.eu/issues/8898
2023-07-25 17:29:50 +02:00
Giambattista Bloisi
e64c2854a3
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
...
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi
bb5b845e3c
Use scala.binary.version property to resolve scala maven dependencies
...
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Claudio Atzori
b76a47b103
[aggregator graph] added column alias when mapping organization PIDs from the OpenOrgs database
2023-06-13 11:38:10 +02:00
Claudio Atzori
ad04f14b81
Merge branch 'beta' into distinct_pids_from_openorgs_beta
2023-06-12 09:58:21 +02:00
Claudio Atzori
e1409ffe80
update sql query to return distinct pids
2023-06-12 09:47:45 +02:00
Claudio Atzori
e45777e7e1
[aggregator graph] added validation for URLs mapped from oaf:fulltext
2023-05-26 11:33:42 +02:00
Claudio Atzori
8acad52a0c
Merge branch 'beta' into apc_affiliation
2023-05-15 15:47:33 +02:00
Claudio Atzori
8a463cc3e8
fixed organization id created when mapping APC affiliations. Factored out ROR constants in dhp-common
2023-05-15 15:44:46 +02:00
Miriam Baglioni
99ac5bab46
added check to avoid NPE when checking the organization country
2023-05-04 19:38:39 +02:00
Claudio Atzori
d8882c4481
extended mapping applied to datacite records to produce affiliations using the ROR ids. Inc ase of APCs it includes the amount and the currently in the relation
2023-05-02 11:56:51 +02:00
Claudio Atzori
851f664bd9
Merge branch 'beta' into graph_cleaning_refactoring
2023-05-02 09:55:40 +02:00
Claudio Atzori
a2dcb06daf
added eoscifguidelines in the result view; removed compute statistics statements
2023-04-11 10:43:32 +02:00
Claudio Atzori
864f4051d3
[graph cleaning] added missing case
2023-04-05 11:35:47 +02:00
Claudio Atzori
dead87917f
[graph cleaning] cleanup
2023-04-04 13:13:43 +02:00
Claudio Atzori
2a6ba29b64
[graph cleaning] unit tests & cleanup
2023-04-04 12:34:51 +02:00
Claudio Atzori
b502f86523
fixed input path supplemented to GetDatasourceFromCountry; adjusted the various spark.sql.shuffle.partitions
2023-03-24 13:09:12 +01:00
Claudio Atzori
c07857fa37
[graph cleaning] unit tests & cleanup
2023-03-23 15:57:47 +01:00
Claudio Atzori
90e61a8aba
[graph cleaning] WIP: refactoring of the cleaning stages, unit tests
2023-03-23 15:03:26 +01:00
Claudio Atzori
488d9a5eaa
[graph cleaning] WIP: refactoring of the cleaning stages, unit tests
2023-03-23 10:41:13 +01:00
Claudio Atzori
4f5ba0ed52
[graph cleaning] WIP: refactoring of the cleaning stages, unit tests
2023-03-21 14:41:20 +01:00
Claudio Atzori
6d3d18d8b5
[graph cleaning] WIP: refactoring of the cleaning stages
2023-03-16 17:23:36 +01:00
Claudio Atzori
518618f1a9
[graph cleaning] avoid to overwrite the subject class to 'keyword' for those with provenance 'subject:fos'
2023-03-14 15:22:47 +01:00
Claudio Atzori
e28d395e87
[aggregator graph] using dedicated path to sync claims, adjusted paths with wildcards
2023-03-08 21:16:52 +01:00
Claudio Atzori
5b8fd37314
[aggregator graph] using dedicated path to sync claims
2023-03-08 15:28:14 +01:00
Claudio Atzori
7fd89566c2
[aggregator graph] handle paths including wildcards
2023-03-08 12:43:00 +01:00
Claudio Atzori
8ec0d62d91
pre-group the records in each table before joning the contents from BETA and PROD together
2023-03-02 14:49:19 +01:00
Claudio Atzori
6f488547a7
ignore non processable records
2023-03-01 14:49:51 +01:00
Claudio Atzori
7d263f265e
adjusted logs
2023-03-01 11:58:07 +01:00
Claudio Atzori
9c59dac859
followup changes reorganising the mdstore synchronisation mechanism
2023-03-01 10:16:20 +01:00
Sandro La Bruzzo
78e51c182a
Added missing parametero to raw all workflow
2023-02-28 10:16:01 +01:00
Michele Artini
fddcf701e9
updated the order of the compatibilities
2023-02-22 12:07:09 +01:00
Sandro La Bruzzo
8920932dd8
Code formatted
2023-02-08 11:34:18 +01:00
Sandro La Bruzzo
6c81a161d2
Merge remote-tracking branch 'origin/beta' into 8231-mdstore-synch-improve
2023-02-08 10:29:09 +01:00
Miriam Baglioni
d6895f0387
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2023-01-09 17:28:38 +01:00
Sandro La Bruzzo
3c9826f186
updated lines function to it's implementation linesWithSeparators.map(l => l.stripLineEnd) in this way we force scala plugin compiler to consider this pipeline scala code and not java.string.lines() pipeline
2022-12-21 11:21:17 +01:00
Miriam Baglioni
8685eaa706
[Clean Country] added test to verify remove of country
2022-12-16 15:31:25 +01:00
Miriam Baglioni
dc0ec88a58
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2022-12-16 13:18:32 +01:00
Miriam Baglioni
d791840b82
[Clean Country] added test to verify remove of country:
2022-12-16 13:18:29 +01:00
Claudio Atzori
7b80b24f82
[cleaning] country cleaning must use both PID and AlternateIdentifier fields
2022-12-15 14:49:04 +01:00
Claudio Atzori
b8bafab8a0
[cleaning] improved vocabulary based mapping, specialization for the strict vocab cleaning
2022-12-12 14:43:03 +01:00
Sandro La Bruzzo
5e4866d033
implemented synch for single mdstore
2022-12-12 11:29:46 +01:00
Claudio Atzori
c18b8048c3
[cleaning] avoid NPE
2022-12-10 11:41:38 +01:00
Claudio Atzori
8b44afe5e5
[cleaning] avoid NPE
2022-12-09 15:44:57 +01:00
Claudio Atzori
389dd25430
[cleaning] avoid NPE
2022-12-08 18:40:48 +01:00
Claudio Atzori
730228d73d
[cleaning] align wf parameter names in test
2022-12-08 18:40:22 +01:00
Claudio Atzori
2094fa6db0
[cleaning] align wf parameter names
2022-12-08 17:22:26 +01:00
Miriam Baglioni
a485a94956
[Cleaning] fixed parameter name in property file
2022-12-08 16:59:34 +01:00
Miriam Baglioni
3d99b78d94
[Cleaning] fixed error in parameter (workingPath to workingDir)
2022-12-08 10:25:02 +01:00
Sandro La Bruzzo
5a48a2fb18
implemented synch for single mdstore
2022-12-01 11:34:43 +01:00
Claudio Atzori
8e3edba318
[graph cleaning] testing the collectedfron and hostedby patch procedure
2022-11-29 16:07:09 +01:00
Claudio Atzori
58c05731f9
[graph cleaning] WIP: testing the collectedfron and hostedby patch procedure
2022-11-29 11:21:51 +01:00
Claudio Atzori
11695ba649
[graph cleaning] patch also the result's collectedfrom and hostedby datasource name according to the datasource master-duplicate mapping
2022-11-28 10:18:43 +01:00
Claudio Atzori
24ef301cc1
[graph cleaning] patch the result's collectedfrom and hostedby identifiers according to the datasource master-duplicate mapping
2022-11-28 09:54:18 +01:00
Alessia Bardi
3c08269a4d
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2022-11-22 17:31:00 +01:00
Alessia Bardi
2687fc9f73
tests for EOSC Future review - ROhub
2022-11-22 17:30:56 +01:00
Claudio Atzori
7c3390ac10
Merge branch 'beta' into eoscifguidelines-from-mdstores
2022-11-07 12:18:40 +01:00
Sandro La Bruzzo
2b9a20a4a3
Changed the way Scholexplorer filter the relationships, I found that filter all relation coming from openCitation is wrong, because we loose a lot of relation than intersect OpenCitation, but they don't come only from there
2022-10-24 12:53:47 +02:00
Alessia Bardi
208ed32315
fixed xpath for semantic relation
2022-10-23 18:18:13 +02:00
Alessia Bardi
ee759ac92d
file format after mvn compile
2022-10-23 18:09:47 +02:00
Alessia Bardi
31a10f000b
Map the field oaf:eoscifguidelines from mdstores. Currently we can find it in ROHub metadata
2022-10-23 18:05:37 +02:00
Claudio Atzori
ae7cd0735a
[graph2hive] more partitions
2022-10-14 15:47:58 +02:00
Claudio Atzori
b47aaf4dd1
[cleaning] subjects declared as belonging to specific vocabularies whose values are not found in the vocab are set to type keyword
2022-10-13 11:23:43 +02:00
Claudio Atzori
6163ecbf63
[cleaning] renamed parameters in wf action
2022-10-11 11:20:03 +02:00
Claudio Atzori
b301e9fdff
[cleaning] renamed action name/description
2022-10-11 11:08:52 +02:00
Claudio Atzori
ece40adc09
[cleaning] fixing NPE in the country cleaning phase
2022-10-11 10:10:20 +02:00
Claudio Atzori
8d97949316
[cleaning] fixed loop in wf nodes
2022-10-07 09:52:45 +02:00
Alessia Bardi
49360770d7
map w3id as instance url
2022-09-28 14:16:39 +02:00
Miriam Baglioni
b5b5a4c192
[CleanCountry] fixed issue
2022-09-28 12:42:51 +02:00
Claudio Atzori
3f90d159e3
code formatting
2022-09-27 15:08:00 +02:00
Claudio Atzori
0b3e44e521
Merge branch 'beta' into relation-from-odf
2022-09-27 14:57:01 +02:00
Claudio Atzori
57dbeb08d2
code formatting
2022-09-27 14:55:10 +02:00
Claudio Atzori
25e9d92aad
Merge branch 'beta' into clean_country
2022-09-27 14:27:49 +02:00
Alessia Bardi
fd63e9bfac
Mapping all relationships supported in ModelConstants and ModelSupport
2022-09-26 11:24:13 +02:00
Alessia Bardi
c5eb722170
relationships from relatedIdentifier whose target id type is one of the pid type with an authority
2022-09-23 15:47:05 +02:00
Claudio Atzori
c86cc53520
suppressing hyper verbose spark logs during unit test execution
2022-09-23 15:20:40 +02:00
Alessia Bardi
ba33ff71fd
refactoring for the generation of relationships from related identifier of type 'OPENAIRE'
2022-09-23 15:17:13 +02:00
Alessia Bardi
982bcc1e35
test wrid pid and record identifier
2022-09-23 12:06:06 +02:00
Claudio Atzori
c42850328e
fixed semantic (subreltype) for ServiceOrganization relations
2022-09-22 16:23:25 +02:00
Claudio Atzori
e45ec15221
Merge branch 'beta' into clean_country
2022-09-19 11:34:02 +02:00
Claudio Atzori
26e1badded
added instance.url syntactical validation, avoid creating multiple duplicated URLs
2022-09-19 11:19:10 +02:00
Claudio Atzori
192215a18e
merged from branch discard-non-wellformed
2022-09-19 10:17:10 +02:00
Claudio Atzori
e370e940d8
[aggregator graph] save invalid records aside for further inspection
2022-09-16 14:06:28 +02:00
Claudio Atzori
1e42d984e1
[aggregator graph] save invalid records aside for further inspection
2022-09-15 10:49:42 +02:00
Alessia Bardi
9e7ec4198f
fixed test
2022-09-14 18:08:56 +02:00
Claudio Atzori
c48f6e9c57
[aggregator graph] save invalid records aside for further inspection
2022-09-14 17:11:26 +02:00
Claudio Atzori
a0919ed495
[aggregator graph] save invalid records aside for further inspection
2022-09-14 13:27:39 +02:00
Alessia Bardi
b99a011345
return empty Oaf list if record cannot be parsed
2022-09-13 11:51:55 +02:00
Alessia Bardi
27af5122d2
logs for non well formed XML files
2022-09-12 14:25:23 +02:00
Claudio Atzori
ff6f789b6d
code formatting
2022-09-09 15:16:31 +02:00
Claudio Atzori
b5d6966c01
Merge branch 'beta' into clean_country
2022-09-09 12:20:19 +02:00
Claudio Atzori
b5f7bd30be
Merge branch 'beta' into clean_subjects
2022-09-09 12:20:04 +02:00
Alessia Bardi
a539c6ccaf
https for handle URLs
2022-09-09 12:16:28 +02:00
Claudio Atzori
1203378441
Merge branch 'beta' into clean_subjects
2022-09-09 10:38:47 +02:00
Claudio Atzori
14dc909a14
Merge branch 'beta' into clean_country
2022-09-09 10:38:17 +02:00
Alessia Bardi
9ef063d502
#7861#note-8 instance url from handle
2022-09-07 17:29:54 +03:00
Alessia Bardi
5c45d52af3
testing for RiuNet
2022-09-07 15:40:57 +03:00
Alessia Bardi
a11eb38065
testing for RO-Hub
2022-09-02 16:07:36 +02:00
Claudio Atzori
b7c387c21f
cleaning of subjects: avoid duplicated subjects, prioritise collected vs inferred or other sources
2022-08-12 15:09:16 +02:00
Claudio Atzori
adb526b0e1
Merge branch 'beta' into clean_subjects
2022-08-12 10:51:17 +02:00
Claudio Atzori
cb7c07c54e
[scholix] added step to create tar archive
2022-08-11 11:25:24 +02:00
Claudio Atzori
2aa16d0432
[scholix] fixed OpenCitation dump procedure
2022-08-10 17:39:29 +02:00
Miriam Baglioni
7dbdd4a0fe
[Clean Country]changes related to D-Net/dnet-hadoop#241 (comment)
2022-08-10 15:13:10 +02:00
Claudio Atzori
51ad93e545
[scholix] fixed OpenCitation dump procedure
2022-08-10 11:57:56 +02:00
Miriam Baglioni
62d2138806
[Clean Context] changed a bit the logic. Added the check not to have result hosted by a datasource of type institutional repository from NL. Added also the check that the country should have been included in the result via propagation for it to be removed
2022-08-08 14:10:47 +02:00
Claudio Atzori
3418ce50ac
cleaning of subjects: perform the cleaning when the given value is equivalent to one of the terms in the vocabulary
2022-08-08 12:48:47 +02:00
Miriam Baglioni
390013a4b2
mergin with branch beta
2022-08-08 12:30:31 +02:00
Claudio Atzori
4eaa063b1f
cleaning of subjects
2022-08-05 16:56:09 +02:00
Claudio Atzori
32cee1f619
WIP: cleaning of subjects
2022-08-05 12:32:08 +02:00
Claudio Atzori
6c0fd9284b
merge from beta
2022-08-05 10:42:53 +02:00
Claudio Atzori
b78889a0ce
WIP: cleaning of subjects
2022-08-05 09:11:37 +02:00
Miriam Baglioni
a7a18d7630
[Graph Dump] removed code for the dump from the project. Fixed issues in tests when possible
2022-08-04 17:40:40 +02:00
Claudio Atzori
27a91841e7
WIP: cleaning of subjects
2022-08-04 11:39:39 +02:00
Claudio Atzori
e62018e95d
[aggregator graph] added more assertions in test
2022-08-03 12:26:05 +02:00
Claudio Atzori
f62c4e05cd
code formatting
2022-07-29 11:56:01 +02:00
Claudio Atzori
1dd1e4fe3a
extended test for mapping project_organization relations
2022-07-28 11:27:08 +02:00
Claudio Atzori
09ccc7b472
Merge branch 'beta' into project_organization_contribution
2022-07-28 09:49:59 +02:00
Miriam Baglioni
5968ec018d
[Clean Country] modified workflow and added param file
2022-07-22 16:48:38 +02:00
Miriam Baglioni
a12d28c644
[Clean Country] added logic not to remove country from result if it exist a hosting datasource with that country. Moreover the country will be removed only if added with propagation
2022-07-22 16:23:12 +02:00
Miriam Baglioni
2c933f1158
mergin with branch beta
2022-07-22 14:57:41 +02:00
Sandro La Bruzzo
ddc414b258
fixed wrong json param
2022-07-22 09:43:15 +02:00
Sandro La Bruzzo
5f651f2316
changed filter relation on SubRelType
2022-07-21 10:11:48 +02:00
Miriam Baglioni
65cc736e2f
[Clean Country] first implementation to remove country NL from results collected from NARCIS when doi starts with mendely prefix
2022-07-20 17:05:56 +02:00
Sandro La Bruzzo
5b76321d9c
implemented oozie workflow to generate scholix dump filtering relclass semantic
2022-07-20 16:34:32 +02:00
Claudio Atzori
1138b2ac8e
code formatting
2022-07-19 14:15:49 +02:00
Claudio Atzori
0c1cfee396
mapping oaf:fulltext elements in the result.fulltext field
2022-07-11 17:34:59 +02:00
Claudio Atzori
0cb1c70788
code formatting
2022-07-01 10:44:08 +02:00
Claudio Atzori
4ec13e2b66
Merge branch 'master' into dump_new_funded_products
2022-07-01 10:30:28 +02:00
Claudio Atzori
7da24c1dec
added more logging
2022-06-28 13:47:49 +02:00
Miriam Baglioni
71744a1f52
[DUMP DELTA PROJECTS] refactoring
2022-06-27 18:07:58 +02:00
Miriam Baglioni
1d1fe3b151
[DUMP DELTA PROJECTS] refactoring
2022-06-27 18:04:59 +02:00
Claudio Atzori
a8773af0cb
Merge branch 'beta' into project_organization_contribution
2022-06-27 09:37:40 +02:00
Claudio Atzori
5130eac247
mapping by participant project contribution
2022-06-24 17:16:42 +02:00
Miriam Baglioni
edddfc6c63
[DUMP DELTA PROJECTS] adding test and resource
2022-06-21 18:28:53 +02:00
Miriam Baglioni
f561f13dd9
[Funder Products Dump] fixed names of parameters in workflow
2022-06-21 18:18:17 +02:00
Miriam Baglioni
ff74e73369
[DUMP NEW FUNDED PRODUCTS] change in resources
2022-06-21 18:02:51 +02:00
Miriam Baglioni
b98f904d48
[Funder Products Dump] new way to avoid using hive
2022-06-21 17:52:27 +02:00
Miriam Baglioni
7423577a08
[Graph DUMP] add code to produce the delta of new projects with respect to the previous delta/dump
2022-06-21 14:51:38 +02:00
Claudio Atzori
b295a40d9c
restored use of name_particles when parsing author names
2022-06-16 12:20:43 +02:00
Claudio Atzori
4c8e820ff0
mapping relationship from trasformed records based on oaf:relation
2022-06-14 08:49:02 +02:00
Claudio Atzori
116902c028
mapping relationship from trasformed records based on oaf:relation
2022-06-13 14:31:48 +02:00
Alessia Bardi
68bd58d6a4
tests for ROHub
2022-06-10 17:29:11 +02:00
Claudio Atzori
52cb086506
[graph grouping] drop relation target path before copying from source
2022-05-16 12:08:36 +02:00
Claudio Atzori
997c50078e
[graph grouping] drop relation target path before copying from source
2022-05-16 12:07:40 +02:00