Giambattista Bloisi
831cc1fdde
Generate "merged" dedup id relations also for records that are filtered out by the cut parameters
2024-01-10 22:59:52 +01:00
Giambattista Bloisi
1287315ffb
Do no longer use dedupId information from pivotHistory Database
2024-01-10 22:59:52 +01:00
Giambattista Bloisi
02636e802c
SparkCreateSimRels:
...
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests
SparkWhitelistSimRels: use left semi join for clarity and performance
SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions
DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Claudio Atzori
16d858fbf0
Merge pull request 'enrichmentSingleStep' ( #373 ) from enrichmentSingleStep into beta
...
Reviewed-on: D-Net/dnet-hadoop#373
2024-01-10 16:58:49 +01:00
Miriam Baglioni
e711a05229
fixed conflicts
2024-01-10 11:03:42 +01:00
Miriam Baglioni
71d6f30711
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2024-01-10 10:59:58 +01:00
Miriam Baglioni
cb14470ba6
added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation
2023-12-22 14:50:05 +01:00
Miriam Baglioni
9f966b59d4
added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation
2023-12-22 14:11:47 +01:00
Miriam Baglioni
2f3b5a133d
added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation
2023-12-22 13:56:40 +01:00
Miriam Baglioni
2f7b9ad815
added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation
2023-12-22 11:46:15 +01:00
Miriam Baglioni
f2352e8a78
changed in the classes the path for the property files for the propagation of community from project
2023-12-22 11:43:34 +01:00
Miriam Baglioni
009730b3d1
added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation
2023-12-22 11:42:09 +01:00
Miriam Baglioni
89f269c7f4
changed the path to the parameter file in the class for entitytoorganization propagation
2023-12-22 11:37:50 +01:00
Miriam Baglioni
b06aea0adf
adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class
2023-12-22 11:35:37 +01:00
Miriam Baglioni
3afd4aa57b
adjustments for country propagation
2023-12-22 11:27:30 +01:00
Claudio Atzori
62104790ae
added metaresourcetype to the result hive DB view
2023-12-21 12:27:10 +01:00
Claudio Atzori
106968adaa
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2023-12-21 12:26:29 +01:00
Claudio Atzori
a8a4db96f0
added metaresourcetype to the result hive DB view
2023-12-21 12:26:19 +01:00
Miriam Baglioni
5011c4d11a
refactoring after compiletion
2023-12-20 15:57:26 +01:00
Miriam Baglioni
4740c808f7
-
2023-12-20 14:26:54 +01:00
Miriam Baglioni
d410ea8a41
added needed parameter
2023-12-19 12:15:01 +01:00
Sandro La Bruzzo
37e36baf76
updated workflow for generation of Scholix Datasource's to use mdstore transactions
2023-12-18 16:05:35 +01:00
Sandro La Bruzzo
9d39845d1f
uploaded input parameters on CreateBaseline WF
2023-12-18 12:23:12 +01:00
Sandro La Bruzzo
15fd93a2b6
uploaded input parameters on CreateBaseline WF
2023-12-18 12:21:55 +01:00
Sandro La Bruzzo
9d342a47da
updated the transformation Baseline workflow to include mdstore rollback/commit action
2023-12-18 11:48:57 +01:00
Sandro La Bruzzo
1fbd4325f5
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2023-12-18 11:47:17 +01:00
Sandro La Bruzzo
1f1a6a5f5f
updated the transformation Baseline workflow to include mdstore rollback/commit action
2023-12-18 11:47:00 +01:00
Miriam Baglioni
3eca5d2e1c
-
2023-12-18 09:55:27 +01:00
Miriam Baglioni
01ce0b9c76
[doiboost - preprocess] remove transition to orcid preparation from sequence of steps at the beginning of the workflow
2023-12-15 12:24:55 +01:00
Miriam Baglioni
0d8e496a63
-
2023-12-15 12:16:43 +01:00
Claudio Atzori
c4ec35b6cd
Merge pull request 'Master branch updates from beta December 2023' ( #369 ) from beta_to_master_dicember2023 into master
...
Reviewed-on: D-Net/dnet-hadoop#369
2023-12-15 11:18:30 +01:00
Claudio Atzori
1726f49790
code formatting
2023-12-15 10:37:02 +01:00
Claudio Atzori
a59be5779e
Merge pull request '9078_xml_records_irish_tender' ( #368 ) from 9078_xml_records_irish_tender into beta
...
Reviewed-on: D-Net/dnet-hadoop#368
2023-12-12 12:34:43 +01:00
Claudio Atzori
ff924215b8
[graph provision] added tests for new peerreviewed field
2023-12-12 11:21:30 +01:00
Claudio Atzori
a6d635e695
Merge branch 'beta' into 9078_xml_records_irish_tender
2023-12-12 11:06:42 +01:00
Claudio Atzori
98cce5bfb2
code formatting
2023-12-12 09:59:05 +01:00
Claudio Atzori
84d54643cf
[cleaning] allow enriched orcids to pass the cleaning, rule out non-orcid author pids
2023-12-12 09:57:00 +01:00
Claudio Atzori
7e8eff40c1
[graph provision] added tests for the new model fields
2023-12-12 08:54:15 +01:00
Miriam Baglioni
8752d275fa
removed not needed parameter
2023-12-09 15:24:45 +01:00
Miriam Baglioni
d4eedada71
adjusting workflow definition
2023-12-09 15:20:11 +01:00
Claudio Atzori
aba95ed1d1
code formatting
2023-12-08 17:06:19 +01:00
Claudio Atzori
2877839df0
Merge pull request '[graph cleaning] added cleaning for result.publisher and result.instance.license' ( #366 ) from clean_license_publisher into beta
...
Reviewed-on: D-Net/dnet-hadoop#366
2023-12-08 16:58:37 +01:00
Claudio Atzori
34abd0fc43
Merge branch 'beta' into clean_license_publisher
2023-12-08 16:58:27 +01:00
Claudio Atzori
cb71a7936b
[graph cleaning] avoid stack overflow error when navigating Oaf objects declaring an Enum
2023-12-07 23:09:54 +01:00
Claudio Atzori
70eb1796b2
logging typo
2023-12-07 14:08:04 +01:00
Claudio Atzori
c381bacee0
[enrichment] passing the community API base URL
2023-12-07 14:07:11 +01:00
Miriam Baglioni
336fb31d87
[community_result_propagation] adjusting starting poit of workflow
2023-12-07 10:27:25 +01:00
Miriam Baglioni
c0cde53bf6
[bulktagging] setting first step of bulktaggin as the copy of the entities and relations not involved in the tagging'
2023-12-07 10:08:35 +01:00
Miriam Baglioni
616622d2bb
first version of the workflow single step
2023-12-07 09:59:52 +01:00
Claudio Atzori
259c69e446
[orcid enrichment] fixed workflow definition
2023-12-06 19:41:53 +01:00