Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4 #324

Merged
claudio.atzori merged 2 commits from dedup-with-dataframe-2 into beta 2023-07-25 10:17:18 +02:00

Refactor SparkCreateSimRels process to use Spark Dataframe API:

  • Remove MapDocument class and represent entities with Spark API Row interface
  • Deterministic order of elements in a clustering by sorting by "order" and "identifier" fields
  • do not emit inverse relations, duplicates are avoided by sorting from and to identifers (from minor than to lexicographically)
  • minor performance improvements when evaluating similarity

Build for both Spark 2.4 and Scala 2.11, and Spark 3.4 and Scala 2.12 (WIP)

  • Use scala.binary.version property to resolve scala maven dependencies:
  • Ensure consistent usage of maven properties
  • "scala-2.12" profile for compiling with scala 2.12 and Spark 3.4
  • Syntax changes to compile with both scala 2.11 and scala 2.12

Refacor SparkWhitelistSimRels with dataframe
JsonPath cache contention performance problem fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement using precompiled Patterns

Refactor SparkCreateSimRels process to use Spark Dataframe API: - Remove MapDocument class and represent entities with Spark API Row interface - Deterministic order of elements in a clustering by sorting by "order" and "identifier" fields - do not emit inverse relations, duplicates are avoided by sorting from and to identifers (from minor than to lexicographically) - minor performance improvements when evaluating similarity Build for both Spark 2.4 and Scala 2.11, and Spark 3.4 and Scala 2.12 (WIP) - Use scala.binary.version property to resolve scala maven dependencies: - Ensure consistent usage of maven properties - "scala-2.12" profile for compiling with scala 2.12 and Spark 3.4 - Syntax changes to compile with both scala 2.11 and scala 2.12 Refacor SparkWhitelistSimRels with dataframe JsonPath cache contention performance problem fixed by using a ConcurrentHashMap Blacklist filtering performance improvement using precompiled Patterns
giambattista.bloisi added 2 commits 2023-07-24 15:51:11 +02:00
bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
giambattista.bloisi requested review from sandro.labruzzo 2023-07-24 15:51:42 +02:00
giambattista.bloisi requested review from claudio.atzori 2023-07-24 15:51:42 +02:00
claudio.atzori merged commit 8c63e4a864 into beta 2023-07-25 10:17:18 +02:00
claudio.atzori deleted branch dedup-with-dataframe-2 2023-07-25 10:17:18 +02:00
Sign in to join this conversation.
No description provided.