Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4 #324

giambattista.bloisi · 2023-07-24T15:51:10+02:00

giambattista.bloisi commented

2023-07-24 15:51:10 +02:00

Refactor SparkCreateSimRels process to use Spark Dataframe API:

Remove MapDocument class and represent entities with Spark API Row interface
Deterministic order of elements in a clustering by sorting by "order" and "identifier" fields
do not emit inverse relations, duplicates are avoided by sorting from and to identifers (from minor than to lexicographically)
minor performance improvements when evaluating similarity

Build for both Spark 2.4 and Scala 2.11, and Spark 3.4 and Scala 2.12 (WIP)

Use scala.binary.version property to resolve scala maven dependencies:
Ensure consistent usage of maven properties
"scala-2.12" profile for compiling with scala 2.12 and Spark 3.4
Syntax changes to compile with both scala 2.11 and scala 2.12

Refacor SparkWhitelistSimRels with dataframe
JsonPath cache contention performance problem fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement using precompiled Patterns

Refactor SparkCreateSimRels process to use Spark Dataframe API: - Remove MapDocument class and represent entities with Spark API Row interface - Deterministic order of elements in a clustering by sorting by "order" and "identifier" fields - do not emit inverse relations, duplicates are avoided by sorting from and to identifers (from minor than to lexicographically) - minor performance improvements when evaluating similarity Build for both Spark 2.4 and Scala 2.11, and Spark 3.4 and Scala 2.12 (WIP) - Use scala.binary.version property to resolve scala maven dependencies: - Ensure consistent usage of maven properties - "scala-2.12" profile for compiling with scala 2.12 and Spark 3.4 - Syntax changes to compile with both scala 2.11 and scala 2.12 Refacor SparkWhitelistSimRels with dataframe JsonPath cache contention performance problem fixed by using a ConcurrentHashMap Blacklist filtering performance improvement using precompiled Patterns

giambattista.bloisi added 2 commits 2023-07-24 15:51:11 +02:00

bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies

Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4

e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface

JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)

giambattista.bloisi requested review from sandro.labruzzo 2023-07-24 15:51:42 +02:00

giambattista.bloisi requested review from claudio.atzori 2023-07-24 15:51:42 +02:00

claudio.atzori merged commit 8c63e4a864 into beta

2023-07-25 10:17:18 +02:00

claudio.atzori deleted branch dedup-with-dataframe-2

2023-07-25 10:17:18 +02:00

claudio.atzori referenced this issue from a commit

2023-07-25 10:17:19 +02:00

Merge pull request 'Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4' (#324) from dedup-with-dataframe-2 into beta

claudio.atzori referenced this pull request

2023-09-04 16:40:25 +02:00

Master branch updates from beta September 2023 #337