Master branch updates from beta January 2024 #385

claudio.atzori · 2024-01-26T16:03:32+01:00

claudio.atzori commented

2024-01-26 16:03:32 +01:00

#367 - Improvements and refactoring in Dedup
#373 - Graph enrichment workflows implemented as a single step
#374 - Refined mapping for the extraction of the original resource type used in the context of the COAR resource type attribution
#375 - [FoS integration] fix issue on FoS integration: removing the null values from FoS
#379 - [Graph provision] Use the context API instead the ISLookUp service
#380 - [Graph provision] updated param specification for the XML converter job
#383 - [DOIBoost] Fixed problem on missing author in Crossref Mapping
3e96777cc4 [metadata collection] increased logging from the oai-pmh metadata collection process

* #367 - Improvements and refactoring in Dedup * #373 - Graph enrichment workflows implemented as a single step * #374 - Refined mapping for the extraction of the original resource type used in the context of the COAR resource type attribution * #375 - [FoS integration] fix issue on FoS integration: removing the null values from FoS * #379 - [Graph provision] Use the context API instead the ISLookUp service * #380 - [Graph provision] updated param specification for the XML converter job * #383 - [DOIBoost] Fixed problem on missing author in Crossref Mapping * 3e96777cc4 [metadata collection] increased logging from the oai-pmh metadata collection process

claudio.atzori added 63 commits 2024-01-26 16:03:33 +01:00

616622d2bb first version of the workflow single step

d4eedada71 adjusting workflow definition

8752d275fa removed not needed parameter

0d8e496a63 -

01ce0b9c76 [doiboost - preprocess] remove transition to orcid preparation from sequence of steps at the beginning of the workflow

3eca5d2e1c -

9d342a47da updated the transformation Baseline workflow to include mdstore rollback/commit action

15fd93a2b6 uploaded input parameters on CreateBaseline WF

d410ea8a41 added needed parameter

4740c808f7 -

5011c4d11a refactoring after compiletion

62104790ae added metaresourcetype to the result hive DB view

3afd4aa57b adjustments for country propagation

b06aea0adf adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class

89f269c7f4 changed the path to the parameter file in the class for entitytoorganization propagation

009730b3d1 added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation

f2352e8a78 changed in the classes the path for the property files for the propagation of community from project

2f7b9ad815 added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation

2f3b5a133d added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation

9f966b59d4 added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation

cb14470ba6 added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation

71d6f30711 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta

e711a05229 fixed conflicts

16d858fbf0 Merge pull request 'enrichmentSingleStep' (#373 ) from enrichmentSingleStep into beta

Reviewed-on: #373

02636e802c SparkCreateSimRels:

- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure

1287315ffb Do no longer use dedupId information from pivotHistory Database

831cc1fdde Generate "merged" dedup id relations also for records that are filtered out by the cut parameters

10e135db1e Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids

3c66e3bd7b Create dedup record for "merged" pivots

Do not create dedup records for group that have more than 20 different acceptance date

a88dce5bf3 Merge pull request 'Improvements and refactoring in Dedup' (#367 ) from dedup_increasenumofblocks into beta

Reviewed-on: #367

2753044d13 refined mapping for the extraction of the original resource type

cb9e739484 Merge branch 'beta' into resource_types

c67467723b Merge pull request 'refined mapping for the extraction of the original resource type' (#374 ) from resource_types into beta

Reviewed-on: #374

f612125939 fix issue on FoS integration. Removing the null values from FoS

2d302e6827 Merge pull request '[FoS integration]fix issue on FoS integration. Removing the null values from FoS' (#375 ) from fosPreparationBeta into beta

Reviewed-on: #375

21a14fcd80 Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions

Implements pivots table update oozie workflow

59eaccbd87 [enrichment single step] refactoring to fix issue in disappeared result type

67ce2d54be [enrichment single step] refactoring to fix issues in disappeared result type

82e9e262ee [enrichment single step] remove parameter from execution

628fdfb5eb Merge pull request '[enrichment single step]' (#378 ) from enrichmentSingleStepFixed into beta

Reviewed-on: #378

a12a3eb143 -

6af536541d [enrichment single step] moving parameter file in correct location

bcc0a13981 [enrichment single step] adding <end> element in wf definition

c6b3401596 increased shuffle partitions for publications in the country propagation workflow

2655eea5bc [orcid enrichment] drop paths before copying the non-modifyed contents

1c6db320f4 [graph provision] obtain context info from the context API instead from the ISLookUp service

b9fcc5ad5e Merge pull request 'Context API update' (#379 ) from provision_community_api into beta

Reviewed-on: #379

f76852f385 Merge branch 'beta' into update_pivots_table

bd187ec6e7 Merge pull request 'Implements pivots table update oozie workflow' (#376 ) from update_pivots_table into beta

Reviewed-on: #376

6fd25cf549 code formatting

f87f3a6483 [graph provision] updated param specification for the XML converter job

9812406589 Merge pull request '[graph provision] updated param specification for the XML converter job' (#380 ) from provision_community_api into beta

Reviewed-on: #380

3e96777cc4 [collection] increased logging from the oai-pmh metadata collection process

9b13c22e5d [graph provision] retrieve all the context information by adding all=true to the requests issued to thr API

2c1e6849f0 added code of conduct and contributing files

0c97a3a81a minor

da944a5c55 Merge pull request 'code of conduct and contributing' (#382 ) from contributing into beta

Reviewed-on: #382

2838a9b630 Update 'CONTRIBUTING.md'

a7115cfa9e max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.

9e8fc6aa88 [collection] increased logging from the oai-pmh metadata collection process

e889808daa Fixed problem on missing author in crossref Mapping

ce3200263e Merge branch 'beta' into crossref_missing_author_fix

bf99c424fa Merge pull request 'Fixed problem on missing author in crossref Mapping' (#383 ) from crossref_missing_author_fix into beta

Reviewed-on: #383

claudio.atzori manually merged commit 4d0c59669b into master

2024-01-26 16:09:14 +01:00

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#385