Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4 #324

Merged
claudio.atzori merged 2 commits from dedup-with-dataframe-2 into beta 2023-07-25 10:17:18 +02:00

2 Commits

Author SHA1 Message Date
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00