Claudio Atzori
8c63e4a864
Merge pull request 'Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4' ( #324 ) from dedup-with-dataframe-2 into beta
...
Reviewed-on: D-Net/dnet-hadoop#324
2023-07-25 10:17:17 +02:00
Giambattista Bloisi
e64c2854a3
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
...
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi
bb5b845e3c
Use scala.binary.version property to resolve scala maven dependencies
...
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Claudio Atzori
002b24e06f
Merge pull request '[graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests' ( #315 ) from pid_cleaning into beta
...
Reviewed-on: D-Net/dnet-hadoop#315
2023-07-24 10:49:44 +02:00
Claudio Atzori
c754397a19
Merge branch 'beta' into pid_cleaning
2023-07-24 10:49:31 +02:00
Claudio Atzori
f0678cda09
Merge pull request 'fix_beta_tests' ( #323 ) from fix_beta_tests into beta
...
Reviewed-on: D-Net/dnet-hadoop#323
2023-07-24 10:47:35 +02:00
Giambattista Bloisi
f03153823a
Update testCitationRelations number of expected citations according to changes made in 0559d8b4
(monodirectional citations)
2023-07-21 10:48:28 +02:00
Giambattista Bloisi
54c1eacef1
SparkJobTest was failing because testing workingdir was not cleaned up after eact test
2023-07-21 10:42:24 +02:00
Giambattista Bloisi
5e15f20e6e
Fix entityMerger that was excluding the authors of the first entity in the list to merge
2023-07-21 00:46:54 +02:00
Giambattista Bloisi
0210a14e43
Ignore timestamp differences in PromoteActionPayloadForGraphTableJobTest
2023-07-20 23:45:57 +02:00
Giambattista Bloisi
dba34505de
Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values
2023-07-19 14:24:52 +02:00
Giambattista Bloisi
e47ed1fdb2
Use DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES in json mapper to avoid that tests fail if they encounter unmapped properties
2023-07-19 14:21:40 +02:00
Giambattista Bloisi
38dfebfbe6
Disable MdStoreClientTest test as it requires a local mongodb running and it does not perform any assertions
2023-07-19 14:18:56 +02:00
Giambattista Bloisi
ef493681d9
Merge pull request 'Import dnet-pace-core module in this project and use it after renaming to dhp-pace-core' ( #319 ) from beta_with_pace_core into beta
...
Reviewed-on: D-Net/dnet-hadoop#319
2023-07-11 14:03:15 +02:00
Giambattista Bloisi
801da2fd4a
New sources formatted by maven plugin
2023-07-06 10:28:53 +02:00
Giambattista Bloisi
bd3fcf869a
rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules
2023-07-06 10:02:23 +02:00
Giambattista Bloisi
3b35db5fbd
Import dnet-pace-core module from dnet-dedup repository
2023-07-05 22:23:06 +02:00
Miriam Baglioni
7738372125
[UsageCount] fixed typo in attribute name for datasource table
2023-06-30 18:56:41 +02:00
Sandro La Bruzzo
9963fd6d29
updated log to add subentity
2023-06-28 13:36:05 +02:00
Sandro La Bruzzo
ed7e2ab6d1
reverted mistake on commit workflow.xml
2023-06-28 11:40:19 +02:00
Sandro La Bruzzo
9910ce06ae
added to CreateSimRel the feature to write time log
2023-06-28 11:38:16 +02:00
Miriam Baglioni
2717edafb7
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2023-06-28 11:25:14 +02:00
Miriam Baglioni
2f04c9d149
[BulkTagging] fixing left over for test
2023-06-28 11:24:42 +02:00
Sandro La Bruzzo
bd17c3edc8
added to CreateSimRel the feature to write time log
2023-06-28 11:20:58 +02:00
Sandro La Bruzzo
b195da3a83
Added utility to write time logs during the deduplication phase
2023-06-28 11:20:09 +02:00
Claudio Atzori
0f5a819f44
[graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests
2023-06-23 16:10:49 +02:00
Michele Artini
88a1cbc37d
fixed a datasource id
2023-06-22 07:56:33 +02:00
Claudio Atzori
b0ebf56367
Merge pull request 'Update step15_5.sql' ( #314 ) from antonis.lempesis/dnet-hadoop:beta into beta
...
Reviewed-on: D-Net/dnet-hadoop#314
2023-06-21 10:33:22 +02:00
dimitrispie
2b6370eaee
Update step15_5.sql
...
Bug fix
2023-06-21 11:31:10 +03:00
Claudio Atzori
35e42a86ed
Merge pull request 'Update step15_5.sql' ( #313 ) from antonis.lempesis/dnet-hadoop:beta into beta
...
Reviewed-on: D-Net/dnet-hadoop#313
2023-06-21 10:26:16 +02:00
dimitrispie
74cb060bfe
Update step15_5.sql
...
Add "if not exists" clause
2023-06-21 11:24:06 +03:00
Claudio Atzori
85e016df17
Merge pull request 'Update step16-createIndicatorsTables.sql' ( #312 ) from antonis.lempesis/dnet-hadoop:beta into beta
...
Reviewed-on: D-Net/dnet-hadoop#312
2023-06-21 09:52:33 +02:00
dimitrispie
a475cfcb7b
Update step16-createIndicatorsTables.sql
...
Rename a field in indi_pub_interdisciplinarity
2023-06-21 10:42:02 +03:00
Claudio Atzori
979cf9cd87
Merge pull request 'Update step15.sql' ( #311 ) from antonis.lempesis/dnet-hadoop:beta into beta
...
Reviewed-on: D-Net/dnet-hadoop#311
2023-06-21 09:20:01 +02:00
dimitrispie
4648cd88d4
Update step15.sql
...
Cast score to double
2023-06-21 10:02:19 +03:00
dimitrispie
94d2573c77
Update step15.sql
...
Bug Fix
2023-06-21 09:22:39 +03:00
Claudio Atzori
0561362de2
Merge pull request 'Update step20-createMonitorDB_institutions.sql' ( #309 ) from antonis.lempesis/dnet-hadoop:beta into beta
...
Reviewed-on: D-Net/dnet-hadoop#309
2023-06-20 15:07:09 +02:00
Claudio Atzori
50d7dc0078
[graph enrichment] fixed projectOrganizationPath not being passed to the apply_resulttoorganization_propagation node
2023-06-19 15:42:44 +02:00
Claudio Atzori
fbd9bf704e
indent
2023-06-19 15:41:22 +02:00
Claudio Atzori
6210f6ee48
Merge pull request 'Precompile blacklists patterns before evaluating clustering criteria' ( #1 ) from optimized-clustering into master
...
Reviewed-on: D-Net/dnet-dedup#1
2023-06-19 12:43:49 +02:00
dimitrispie
be2caedb04
Update step20-createMonitorDB_institutions.sql
...
Add openorgs____::1624ff7c01bb641b91f4518539a0c28a Vrije Universiteit Amsterdam
2023-06-19 12:12:17 +03:00
dimitrispie
36e0a8fec4
Changes to Promotion Stats WF
...
1. Add new cluster host at impala-shell commands
2. Add a step for splitting monitor dbs
3. Update workflow.xml to included the new splitting monitor dbs step
2023-06-19 09:44:34 +03:00
Giambattista Bloisi
b0ade43608
Precompile blacklists patterns before evaluating clustering criteria
...
Enable Junit 5 tests in maven builds
Make path comparisons platform-independent
Read String resource files assuming they are encoded in UTF-8
Fix a few test conditions
2023-06-16 09:41:11 +02:00
dimitrispie
4c770a5e29
Update finalizeImpalaCluster.sh
...
Drop views in shadow dbs before dropping the db
2023-06-15 13:25:37 +03:00
dimitrispie
e06d962a6a
Update step15.sql
2023-06-15 12:20:35 +03:00
dimitrispie
afcad08396
Update step20-createMonitorDB_institutions.sql
...
Added openorgs____::c0b262bd6eab819e4c994914f9c010e2 -- National Institute of Geophysics and Volcanology
2023-06-15 10:28:49 +03:00
Claudio Atzori
b9748763e2
Merge pull request '[stats wf] Bug fixes' ( #308 ) from antonis.lempesis/dnet-hadoop:beta into beta
...
Reviewed-on: D-Net/dnet-hadoop#308
2023-06-14 21:57:03 +02:00
dimitrispie
42b8ce2ba4
Update copyDataToImpalaCluster.sh
2023-06-14 19:23:42 +03:00
dimitrispie
2032b0df40
Bug fixes
...
1. Remove tables/views from old databases in the new cluster, before dropping the dbs
2. Fix id in result_accessroute, indi_impact_measures, indi_pub_bronze_oa
2023-06-14 19:09:09 +03:00
Claudio Atzori
b76a47b103
[aggregator graph] added column alias when mapping organization PIDs from the OpenOrgs database
2023-06-13 11:38:10 +02:00