dnet-hadoop/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
..
model SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
AbstractSparkAction.java SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
DatePicker.java Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface 2023-07-24 15:36:24 +02:00
DedupRecordFactory.java SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
DedupUtility.java Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface 2023-07-24 15:36:24 +02:00
IdGenerator.java SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
IdentifierComparator.java added IdentifierComparator 2022-11-09 14:20:59 +01:00
SparkBlockStats.java Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface 2023-07-24 15:36:24 +02:00
SparkCopyOpenorgsMergeRels.java [dedup] use common saveParquet and save methods to ensure outputs are compressed 2023-10-16 10:55:47 +02:00
SparkCopyOpenorgsSimRels.java suggestions from SonarLint 2021-08-11 12:13:22 +02:00
SparkCopyRelationsNoOpenorgs.java [dedup] use common saveParquet and save methods to ensure outputs are compressed 2023-10-16 10:55:47 +02:00
SparkCreateDedupRecord.java suggestions from SonarLint 2021-08-11 12:13:22 +02:00
SparkCreateMergeRels.java SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
SparkCreateOrgsDedupRecord.java [dedup] use common saveParquet and save methods to ensure outputs are compressed 2023-10-16 10:55:47 +02:00
SparkCreateSimRels.java [dedup] use common saveParquet and save methods to ensure outputs are compressed 2023-10-16 10:55:47 +02:00
SparkPrepareNewOrgs.java suggestions from SonarLint 2021-08-11 12:13:22 +02:00
SparkPrepareOrgRels.java suggestions from SonarLint 2021-08-11 12:13:22 +02:00
SparkPropagateRelation.java code formatting 2023-10-06 12:31:17 +02:00
SparkUpdateEntity.java suggestions from SonarLint 2021-08-11 12:13:22 +02:00
SparkWhitelistSimRels.java SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
UpdateOpenorgsJob.java set configuration property in openorgs duplicates wf 2021-10-07 15:39:55 +02:00