giambattista.bloisi
created branch dispatch_filter_invisible_entities in D-Net/dnet-hadoop
2023-08-09 15:46:25 +02:00
giambattista.bloisi
deleted branch cleanup_relations_after_dedup from D-Net/dnet-hadoop
2023-08-09 15:45:51 +02:00
giambattista.bloisi
pushed to cleanup_relations_after_dedup at D-Net/dnet-hadoop
2023-08-07 10:24:30 +02:00
97b6d1dc45
Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags
Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleted by inference or that are pointing to dangling entities
giambattista.bloisi
created branch cleanup_relations_after_dedup in D-Net/dnet-hadoop
2023-08-04 17:25:24 +02:00
giambattista.bloisi
pushed to cleanup_relations_after_dedup at D-Net/dnet-hadoop
2023-08-04 17:25:24 +02:00
af49424b59
Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities
WIP Changes in maven poms to build and test the project using Spark 3.4.x and scala 2.12
giambattista.bloisi
created branch spark34-integration in D-Net/dnet-hadoop
2023-08-02 18:06:15 +02:00
Import affiliation relations from Crossref
This class can be removed by using dataframe api approach
Import affiliation relations from Crossref
It is advisable to compress output file here (using /data/bip-affiliations/data.json as the input the total disk size for output file is reduced from 50Gb to 1.5Gb)
Import affiliation relations from Crossref
That class can be removed by using dataframe api approach
Import affiliation relations from Crossref
AffiliationRelationDeserializer and AffiliationRelationModel are two classes used to store intermediate representation of the data that eventually is put in generated Relation(s). Those two classes leverage lombok annotations to get a few methods generated automatically.
Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4
giambattista.bloisi
pushed to dedup-with-dataframe-2 at D-Net/dnet-hadoop
2023-07-24 15:37:01 +02:00
e64c2854a3
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
giambattista.bloisi
pushed to dedup-with-dataframe-2 at D-Net/dnet-hadoop
2023-07-24 11:14:19 +02:00
45ed6e6229
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
bb5b845e3c
Use scala.binary.version property to resolve scala maven dependencies
002b24e06f
Merge pull request '[graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests' (#315) from pid_cleaning into beta
c754397a19
Merge branch 'beta' into pid_cleaning
f0678cda09
Merge pull request 'fix_beta_tests' (#323) from fix_beta_tests into beta
giambattista.bloisi
pushed to dedup-with-dataframe-2 at D-Net/dnet-hadoop
2023-07-21 13:46:33 +02:00
b21a1107ae
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
giambattista.bloisi
pushed to dedup-with-dataframe-2 at D-Net/dnet-hadoop
2023-07-21 12:23:32 +02:00
587ca0e44d
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
f03153823a
Update testCitationRelations number of expected citations according to changes made in
0559d8b4
(monodirectional citations)
54c1eacef1
SparkJobTest was failing because testing workingdir was not cleaned up after eact test
5e15f20e6e
Fix entityMerger that was excluding the authors of the first entity in the list to merge
0210a14e43
Ignore timestamp differences in PromoteActionPayloadForGraphTableJobTest