Claudio Atzori
03670bb9ce
[dedup] use common saveParquet and save methods to ensure outputs are compressed
2023-10-16 10:55:47 +02:00
Claudio Atzori
eed9fe0902
code formatting
2023-10-06 12:31:17 +02:00
Claudio Atzori
7b403a920f
Merge branch 'beta' into consistency_keep_mergerels
2023-10-02 11:26:00 +02:00
Giambattista Bloisi
2caaaec42d
Include SparkCleanRelation logic in SparkPropagateRelation
...
SparkPropagateRelation includes merge relations
Revised tests for SparkPropagateRelation
2023-09-04 11:33:20 +02:00
Giambattista Bloisi
6cc7d8ca7b
GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob
2023-08-30 10:43:31 +02:00
Giambattista Bloisi
6b1c05d118
Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb
2023-08-29 16:04:19 +02:00
Claudio Atzori
bf35280ea6
code formatting
2023-08-29 11:11:00 +02:00
Claudio Atzori
58665a246c
Merge branch 'beta' into propagate_relation_rewrite
2023-08-29 10:47:02 +02:00
Giambattista Bloisi
d012aec0b3
Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow ( #8964 )
2023-08-28 22:44:54 +02:00
Giambattista Bloisi
a860e19423
Fix ensure all relations are written out, not only those managed by dedup
2023-08-28 15:36:02 +02:00
Giambattista Bloisi
0d7b2bf83d
Rewrite SparkPropagateRelation exploiting Dataframe API
2023-08-28 10:34:54 +02:00
Giambattista Bloisi
95cd2b9b1e
Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
...
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi
fab9920271
DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag
2023-08-09 15:41:43 +02:00
Giambattista Bloisi
97b6d1dc45
Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags
...
Filter relations also by dataInfo.invisible flag
2023-08-07 10:24:11 +02:00
Giambattista Bloisi
af49424b59
Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities
2023-08-04 14:27:39 +02:00
Giambattista Bloisi
e64c2854a3
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
...
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi
5e15f20e6e
Fix entityMerger that was excluding the authors of the first entity in the list to merge
2023-07-21 00:46:54 +02:00
Sandro La Bruzzo
9963fd6d29
updated log to add subentity
2023-06-28 13:36:05 +02:00
Sandro La Bruzzo
ed7e2ab6d1
reverted mistake on commit workflow.xml
2023-06-28 11:40:19 +02:00
Sandro La Bruzzo
9910ce06ae
added to CreateSimRel the feature to write time log
2023-06-28 11:38:16 +02:00
Sandro La Bruzzo
bd17c3edc8
added to CreateSimRel the feature to write time log
2023-06-28 11:20:58 +02:00
Claudio Atzori
909729a2fc
[dedup] tweaking num partitions, minor changes
2023-05-17 10:16:22 +02:00
Claudio Atzori
062abfd669
fixed NPE, removed unused stuff
2022-12-06 12:04:00 +01:00
Claudio Atzori
0aa725083f
extended dedup testing
2022-11-17 16:13:43 +01:00
Claudio Atzori
ddff0e8999
merging duplicates using IdentifierComparator
2022-11-11 16:10:25 +01:00
Claudio Atzori
5af5a8ae42
added IdentifierComparator
2022-11-09 14:20:59 +01:00
Claudio Atzori
61319b2e83
updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates
2022-03-25 16:38:33 +01:00
miconis
8991d097b4
bug fix in the DedupRecordFactory, DataInfo set before merge
2022-02-24 17:13:12 +01:00
Claudio Atzori
391aa1373b
added unit test
2022-01-19 17:13:21 +01:00
Claudio Atzori
44a937f4ed
factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources
2022-01-19 12:24:52 +01:00
Claudio Atzori
2b46b87f56
fixed filtering criteria applied in SparkCopyRelationsNoOpenorgs to keep the parent/child relations from OpenOrgs
2021-11-19 11:30:29 +01:00
Claudio Atzori
a24b9f8268
[dedup] trivial refactoring
2021-11-18 17:12:02 +01:00
Claudio Atzori
c0750fb17c
avoid non necessary count operations over large spark datasets
2021-11-18 17:11:31 +01:00
Claudio Atzori
0a727d325d
[dedup] increased number of partitions in the consistency phase
2021-11-16 08:43:41 +01:00
miconis
611ca511db
set configuration property in openorgs duplicates wf
2021-10-07 15:39:55 +02:00
miconis
9646b9fd98
implementation of the http call for the update of openorgs suggestions
2021-10-07 11:29:11 +02:00
miconis
853333bdde
implementation of the whitelist for similarity relations
2021-09-20 16:21:47 +02:00
Claudio Atzori
9f4db73f30
updated/fixed unit tests
2021-08-11 15:02:51 +02:00
Claudio Atzori
2ee21da43b
suggestions from SonarLint
2021-08-11 12:13:22 +02:00
Claudio Atzori
2fff24df55
code formatting
2021-07-28 11:34:19 +02:00
Sandro La Bruzzo
3920c69bc8
change implementation of resolve Relation to generate jsonRdd in output
2021-07-25 09:51:36 +02:00
Sandro La Bruzzo
058b636d4d
added control to check if the entity exists
2021-07-22 16:08:54 +02:00
Claudio Atzori
41b551562e
applying PR#115 (DatePicker) on stable_ids
2021-06-17 09:33:50 +02:00
Claudio Atzori
23b8883ab1
applied intellij code cleanup
2021-05-14 10:58:12 +02:00
Claudio Atzori
5afa7d3e0c
core utilities in dhp-common moved in external module dhp-schemas
2021-04-27 15:44:01 +02:00
Claudio Atzori
ef4bfd82e2
code formatting
2021-04-27 10:09:31 +02:00
miconis
3c12eeadce
bug fix in propagation of relations
2021-04-22 11:44:33 +02:00
Claudio Atzori
8f309b72ff
[dedup] using node names consistently across the workflow
2021-04-21 17:54:51 +02:00
Claudio Atzori
815b9f4d56
[openorgs dedup] fixed workflow parameter declarations. Introduced support for resuming the execution from intermediate steps
2021-04-20 17:24:45 +02:00
Claudio Atzori
45057440c1
code formatting
2021-04-16 17:28:25 +02:00