Giambattista Bloisi
43b454399f
- Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal
...
- AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations
2024-04-15 18:19:29 +02:00
Claudio Atzori
ef52128c55
included new stats* workflows in parent pom list of modules, code formatting
2024-03-26 10:42:10 +01:00
Claudio Atzori
bfba71a95c
further follow up changes from integrating the mergeutils branch
2024-03-26 09:01:18 +01:00
Claudio Atzori
538b180fe0
Merge branch 'beta' into oaf_country_beta
2024-03-25 16:13:20 +01:00
Giambattista Bloisi
3f22c101d9
Merge pull request 'Enrich authors with ORCID info using new matching algorithm' ( #398 ) from new_orcid_enhancement into beta
...
Reviewed-on: D-Net/dnet-hadoop#398
2024-03-22 17:29:20 +01:00
Giambattista Bloisi
0ff7faad72
Fix conditions that prevented ORCID Enrichment
2024-03-22 16:24:49 +01:00
Giambattista Bloisi
664a381d31
Unify merge logic of entities in MergeUtils.class
2024-03-18 16:04:49 +01:00
Michele Artini
30167aa882
mapped oaf:country from results
2024-03-15 11:24:16 +01:00
Giambattista Bloisi
9092075760
Enrich authors with ORCID info using new matching algorithm
2024-03-11 13:23:59 +01:00
Sandro La Bruzzo
7d806a434c
formatted code
2024-02-28 09:31:58 +01:00
Michele Artini
3268570b2c
mapping of project PIDs
2024-02-22 14:47:21 +01:00
Claudio Atzori
a63b091bae
Merge branch 'beta' into import_orps_fix
2024-02-15 15:01:56 +01:00
Claudio Atzori
d85d2df6ad
[graph raw] fixed mapping of the original resource type from the Datacite format
2024-02-09 10:20:20 +01:00
Claudio Atzori
38c9001147
fixed import of ORPs stored on HDFS in the internal graph format (e.g. Datacite)
2024-02-07 17:02:05 +01:00
Claudio Atzori
42f5506306
[orcid enrichment] fixed directory cleanup before distcp
2024-02-05 09:45:36 +02:00
Alessia Bardi
f2a08d8cc2
test for Italian records from IRS repositories
2024-01-30 19:20:14 +01:00
Claudio Atzori
2655eea5bc
[orcid enrichment] drop paths before copying the non-modifyed contents
2024-01-19 16:28:05 +01:00
Claudio Atzori
cb9e739484
Merge branch 'beta' into resource_types
2024-01-11 16:29:41 +01:00
Claudio Atzori
2753044d13
refined mapping for the extraction of the original resource type
2024-01-11 16:28:26 +01:00
Miriam Baglioni
e711a05229
fixed conflicts
2024-01-10 11:03:42 +01:00
Claudio Atzori
62104790ae
added metaresourcetype to the result hive DB view
2023-12-21 12:27:10 +01:00
Miriam Baglioni
4740c808f7
-
2023-12-20 14:26:54 +01:00
Claudio Atzori
cb71a7936b
[graph cleaning] avoid stack overflow error when navigating Oaf objects declaring an Enum
2023-12-07 23:09:54 +01:00
Claudio Atzori
259c69e446
[orcid enrichment] fixed workflow definition
2023-12-06 19:41:53 +01:00
Claudio Atzori
2a233a89aa
[graph grouping] added isLookupUrl to the workflow definition, passed to the grouping spark aciton
2023-12-03 13:32:52 +01:00
Claudio Atzori
622fafbd2e
Merge branch 'beta' into orcid_import
2023-12-01 12:28:14 +01:00
Sandro La Bruzzo
bf0fd27c36
Removed unused function
...
Applied PR Comment of Giambattista in the PR
2023-12-01 12:16:42 +01:00
Sandro La Bruzzo
cdfb7588dd
code formatting
2023-11-30 15:31:42 +01:00
Sandro La Bruzzo
5e22b67b8a
Merge remote-tracking branch 'origin/beta' into orcid_import
2023-11-30 15:27:46 +01:00
Sandro La Bruzzo
f718caaac9
Added copy of the untouched entities of the graph
2023-11-30 14:51:00 +01:00
Sandro La Bruzzo
7b5e04f37e
removed Orcid intersection on DOIBoost
2023-11-30 14:36:50 +01:00
Claudio Atzori
4e1aac2e2f
resolved conflict in pom.xml before applying the changes from [COAR based resource types & Irish tender] #350
2023-11-29 14:37:52 +01:00
Sandro La Bruzzo
279100fa52
added test
2023-11-29 11:17:58 +01:00
Sandro La Bruzzo
59111713fa
added comment
2023-11-28 09:00:48 +01:00
Sandro La Bruzzo
6f4d0c05ea
Implemented Author MErger for ORCID that takes in account the case when name and surname are swapped
2023-11-28 08:43:56 +01:00
Sandro La Bruzzo
34a4b3cbdf
Implemented ORCID Enrichment
2023-11-24 12:39:58 +01:00
Claudio Atzori
2c77638bf5
Merge branch 'beta' into cleaning_8898
2023-11-22 14:00:10 +01:00
Claudio Atzori
11a1207f9c
[graph cleaning] applying coar based vocabularies in bulk
2023-11-22 12:22:14 +01:00
Claudio Atzori
262d7c581b
[graph cleaning] implemented further suggestions from https://support.openaire.eu/issues/8898
2023-10-31 14:34:10 +01:00
Claudio Atzori
b3a61ea955
Merge branch 'beta' into url_validation
2023-10-25 14:22:56 +02:00
Claudio Atzori
7fc621cdec
added defaults to the graph resolution workflow config-default.xml
2023-10-20 22:28:12 +02:00
Claudio Atzori
2b9d0416ec
[graph raw] URL Validator to accept double slashes
2023-10-19 16:26:37 +02:00
Claudio Atzori
6dfcd0c9a2
[raw graph] mapping original resource types
2023-10-16 12:57:18 +02:00
Claudio Atzori
54fbf09ac6
[raw graph] WIP: mapping original resource types
2023-10-16 08:57:47 +02:00
Claudio Atzori
554551682d
[raw graph] adopting the new COAR based vocabularies for the resource typing
2023-10-11 16:09:19 +02:00
Claudio Atzori
eed9fe0902
code formatting
2023-10-06 12:31:17 +02:00
Claudio Atzori
dc86018a5f
Merge branch 'merge_entities_job' into beta
2023-10-02 11:24:48 +02:00
Alessia Bardi
0935d7757c
Use v5 of the UNIBI Gold ISSN list in test
2023-09-20 15:41:35 +02:00
Alessia Bardi
cc7204a089
tests for d4science catalog
2023-09-20 15:38:32 +02:00
Claudio Atzori
5b06c9d06f
[graph raw] datainfo.invisible set as true only for entities
2023-09-04 15:15:24 +02:00
Giambattista Bloisi
6cc7d8ca7b
GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob
2023-08-30 10:43:31 +02:00
Claudio Atzori
bf35280ea6
code formatting
2023-08-29 11:11:00 +02:00
Giambattista Bloisi
95cd2b9b1e
Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
...
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi
fab9920271
DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag
2023-08-09 15:41:43 +02:00
Miriam Baglioni
c25ac21e5e
Merge pull request 'graph cleaning, suggestions from ticket 8898' ( #325 ) from cleaning_8898 into beta
...
Reviewed-on: D-Net/dnet-hadoop#325
2023-08-08 11:14:19 +02:00
Claudio Atzori
11ffb9bd68
rule out records with NULL dataInfo
2023-07-31 12:35:33 +02:00
Claudio Atzori
270df939c4
partial implementation of the suggestions from https://support.openaire.eu/issues/8898
2023-07-25 17:29:50 +02:00
Giambattista Bloisi
e64c2854a3
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
...
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi
bb5b845e3c
Use scala.binary.version property to resolve scala maven dependencies
...
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Claudio Atzori
b76a47b103
[aggregator graph] added column alias when mapping organization PIDs from the OpenOrgs database
2023-06-13 11:38:10 +02:00
Claudio Atzori
ad04f14b81
Merge branch 'beta' into distinct_pids_from_openorgs_beta
2023-06-12 09:58:21 +02:00
Claudio Atzori
e1409ffe80
update sql query to return distinct pids
2023-06-12 09:47:45 +02:00
Claudio Atzori
e45777e7e1
[aggregator graph] added validation for URLs mapped from oaf:fulltext
2023-05-26 11:33:42 +02:00
Claudio Atzori
8acad52a0c
Merge branch 'beta' into apc_affiliation
2023-05-15 15:47:33 +02:00
Claudio Atzori
8a463cc3e8
fixed organization id created when mapping APC affiliations. Factored out ROR constants in dhp-common
2023-05-15 15:44:46 +02:00
Miriam Baglioni
99ac5bab46
added check to avoid NPE when checking the organization country
2023-05-04 19:38:39 +02:00
Claudio Atzori
d8882c4481
extended mapping applied to datacite records to produce affiliations using the ROR ids. Inc ase of APCs it includes the amount and the currently in the relation
2023-05-02 11:56:51 +02:00
Claudio Atzori
851f664bd9
Merge branch 'beta' into graph_cleaning_refactoring
2023-05-02 09:55:40 +02:00
Claudio Atzori
a2dcb06daf
added eoscifguidelines in the result view; removed compute statistics statements
2023-04-11 10:43:32 +02:00
Claudio Atzori
864f4051d3
[graph cleaning] added missing case
2023-04-05 11:35:47 +02:00
Claudio Atzori
dead87917f
[graph cleaning] cleanup
2023-04-04 13:13:43 +02:00
Claudio Atzori
2a6ba29b64
[graph cleaning] unit tests & cleanup
2023-04-04 12:34:51 +02:00
Claudio Atzori
b502f86523
fixed input path supplemented to GetDatasourceFromCountry; adjusted the various spark.sql.shuffle.partitions
2023-03-24 13:09:12 +01:00
Claudio Atzori
c07857fa37
[graph cleaning] unit tests & cleanup
2023-03-23 15:57:47 +01:00
Claudio Atzori
90e61a8aba
[graph cleaning] WIP: refactoring of the cleaning stages, unit tests
2023-03-23 15:03:26 +01:00
Claudio Atzori
488d9a5eaa
[graph cleaning] WIP: refactoring of the cleaning stages, unit tests
2023-03-23 10:41:13 +01:00
Claudio Atzori
4f5ba0ed52
[graph cleaning] WIP: refactoring of the cleaning stages, unit tests
2023-03-21 14:41:20 +01:00
Claudio Atzori
6d3d18d8b5
[graph cleaning] WIP: refactoring of the cleaning stages
2023-03-16 17:23:36 +01:00
Claudio Atzori
518618f1a9
[graph cleaning] avoid to overwrite the subject class to 'keyword' for those with provenance 'subject:fos'
2023-03-14 15:22:47 +01:00
Claudio Atzori
e28d395e87
[aggregator graph] using dedicated path to sync claims, adjusted paths with wildcards
2023-03-08 21:16:52 +01:00
Claudio Atzori
5b8fd37314
[aggregator graph] using dedicated path to sync claims
2023-03-08 15:28:14 +01:00
Claudio Atzori
7fd89566c2
[aggregator graph] handle paths including wildcards
2023-03-08 12:43:00 +01:00
Claudio Atzori
8ec0d62d91
pre-group the records in each table before joning the contents from BETA and PROD together
2023-03-02 14:49:19 +01:00
Claudio Atzori
6f488547a7
ignore non processable records
2023-03-01 14:49:51 +01:00
Claudio Atzori
7d263f265e
adjusted logs
2023-03-01 11:58:07 +01:00
Claudio Atzori
9c59dac859
followup changes reorganising the mdstore synchronisation mechanism
2023-03-01 10:16:20 +01:00
Sandro La Bruzzo
78e51c182a
Added missing parametero to raw all workflow
2023-02-28 10:16:01 +01:00
Michele Artini
fddcf701e9
updated the order of the compatibilities
2023-02-22 12:07:09 +01:00
Sandro La Bruzzo
8920932dd8
Code formatted
2023-02-08 11:34:18 +01:00
Sandro La Bruzzo
6c81a161d2
Merge remote-tracking branch 'origin/beta' into 8231-mdstore-synch-improve
2023-02-08 10:29:09 +01:00
Miriam Baglioni
d6895f0387
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2023-01-09 17:28:38 +01:00
Sandro La Bruzzo
3c9826f186
updated lines function to it's implementation linesWithSeparators.map(l => l.stripLineEnd) in this way we force scala plugin compiler to consider this pipeline scala code and not java.string.lines() pipeline
2022-12-21 11:21:17 +01:00
Miriam Baglioni
8685eaa706
[Clean Country] added test to verify remove of country
2022-12-16 15:31:25 +01:00
Miriam Baglioni
dc0ec88a58
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2022-12-16 13:18:32 +01:00
Miriam Baglioni
d791840b82
[Clean Country] added test to verify remove of country:
2022-12-16 13:18:29 +01:00
Claudio Atzori
7b80b24f82
[cleaning] country cleaning must use both PID and AlternateIdentifier fields
2022-12-15 14:49:04 +01:00
Claudio Atzori
b8bafab8a0
[cleaning] improved vocabulary based mapping, specialization for the strict vocab cleaning
2022-12-12 14:43:03 +01:00
Sandro La Bruzzo
5e4866d033
implemented synch for single mdstore
2022-12-12 11:29:46 +01:00
Claudio Atzori
c18b8048c3
[cleaning] avoid NPE
2022-12-10 11:41:38 +01:00
Claudio Atzori
8b44afe5e5
[cleaning] avoid NPE
2022-12-09 15:44:57 +01:00