Commit Graph

6 Commits

Author SHA1 Message Date
Claudio Atzori 03670bb9ce [dedup] use common saveParquet and save methods to ensure outputs are compressed 2023-10-16 10:55:47 +02:00
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Claudio Atzori 909729a2fc [dedup] tweaking num partitions, minor changes 2023-05-17 10:16:22 +02:00
Claudio Atzori a24b9f8268 [dedup] trivial refactoring 2021-11-18 17:12:02 +01:00
miconis 611ca511db set configuration property in openorgs duplicates wf 2021-10-07 15:39:55 +02:00
miconis 853333bdde implementation of the whitelist for similarity relations 2021-09-20 16:21:47 +02:00