dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Antonis Lempesis	5ae4b4286c	Merge branch 'beta' of https://code-repo.d3science.org/antonis.lempesis/dnet-hadoop into beta	2024-03-07 12:15:19 +02:00
Antonis Lempesis	316d585c8a	using distinct apcs per publication to avoid huge sums	2024-03-07 02:07:59 +02:00
Antonis Lempesis	dd4c27f4f3	added 2 new institutions in monitor	2024-02-08 12:57:57 +02:00
Antonis Lempesis	a512ead447	changed orcid ids to all capital	2024-01-30 16:54:47 +02:00
Antonis Lempesis	bb10a22290	merged changes from dnet-hadoop	2024-01-29 21:51:47 +02:00
Claudio Atzori	f804c58bc7	Merge pull request 'Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf' (#386 ) from stats_with_spark_sql into beta Reviewed-on: D-Net/dnet-hadoop#386	2024-01-29 09:11:59 +01:00
Claudio Atzori	926903b06b	Merge branch 'beta' into stats_with_spark_sql	2024-01-29 09:11:45 +01:00
Giambattista Bloisi	078df0b4d1	Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf	2024-01-26 21:56:55 +01:00
Claudio Atzori	bf99c424fa	Merge pull request 'Fixed problem on missing author in crossref Mapping' (#383 ) from crossref_missing_author_fix into beta Reviewed-on: D-Net/dnet-hadoop#383	2024-01-26 15:57:23 +01:00
Claudio Atzori	ce3200263e	Merge branch 'beta' into crossref_missing_author_fix	2024-01-26 15:57:04 +01:00
Sandro La Bruzzo	e889808daa	Fixed problem on missing author in crossref Mapping	2024-01-26 12:19:04 +01:00
Claudio Atzori	9e8fc6aa88	[collection] increased logging from the oai-pmh metadata collection process	2024-01-26 09:17:20 +01:00
Antonis Lempesis	c548796463	Changed step16-createIndicatorsTables to use a spark oozie action instead of hive	2024-01-26 02:04:48 +02:00
Antonis Lempesis	a7115cfa9e	max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.	2024-01-25 15:13:16 +01:00
Antonis Lempesis	fd43b0e84a	max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.	2024-01-25 15:06:34 +01:00
Claudio Atzori	2838a9b630	Update 'CONTRIBUTING.md'	2024-01-24 16:07:05 +01:00
Claudio Atzori	da944a5c55	Merge pull request 'code of conduct and contributing' (#382 ) from contributing into beta Reviewed-on: D-Net/dnet-hadoop#382	2024-01-24 15:40:26 +01:00
Claudio Atzori	0c97a3a81a	minor	2024-01-24 10:56:33 +01:00
Claudio Atzori	2c1e6849f0	added code of conduct and contributing files	2024-01-24 10:36:41 +01:00
Claudio Atzori	9b13c22e5d	[graph provision] retrieve all the context information by adding all=true to the requests issued to thr API	2024-01-23 15:36:08 +01:00
Claudio Atzori	3e96777cc4	[collection] increased logging from the oai-pmh metadata collection process	2024-01-23 15:21:03 +01:00
Claudio Atzori	9812406589	Merge pull request '[graph provision] updated param specification for the XML converter job' (#380 ) from provision_community_api into beta Reviewed-on: D-Net/dnet-hadoop#380	2024-01-23 08:55:59 +01:00
Claudio Atzori	f87f3a6483	[graph provision] updated param specification for the XML converter job	2024-01-23 08:54:37 +01:00
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Claudio Atzori	bd187ec6e7	Merge pull request 'Implements pivots table update oozie workflow' (#376 ) from update_pivots_table into beta Reviewed-on: D-Net/dnet-hadoop#376	2024-01-22 16:37:30 +01:00
Claudio Atzori	f76852f385	Merge branch 'beta' into update_pivots_table	2024-01-22 16:37:22 +01:00
Claudio Atzori	b9fcc5ad5e	Merge pull request 'Context API update' (#379 ) from provision_community_api into beta Reviewed-on: D-Net/dnet-hadoop#379	2024-01-22 15:55:33 +01:00
Claudio Atzori	1c6db320f4	[graph provision] obtain context info from the context API instead from the ISLookUp service	2024-01-22 15:53:17 +01:00
Claudio Atzori	2655eea5bc	[orcid enrichment] drop paths before copying the non-modifyed contents	2024-01-19 16:28:05 +01:00
Claudio Atzori	c6b3401596	increased shuffle partitions for publications in the country propagation workflow	2024-01-19 10:15:39 +01:00
Miriam Baglioni	bcc0a13981	[enrichment single step] adding <end> element in wf definition	2024-01-18 17:39:14 +01:00
Miriam Baglioni	6af536541d	[enrichment single step] moving parameter file in correct location	2024-01-18 15:35:40 +01:00
Miriam Baglioni	a12a3eb143	-	2024-01-18 15:18:10 +01:00
Claudio Atzori	628fdfb5eb	Merge pull request '[enrichment single step]' (#378 ) from enrichmentSingleStepFixed into beta Reviewed-on: D-Net/dnet-hadoop#378	2024-01-18 09:41:09 +01:00
Miriam Baglioni	82e9e262ee	[enrichment single step] remove parameter from execution	2024-01-17 17:38:03 +01:00
Miriam Baglioni	67ce2d54be	[enrichment single step] refactoring to fix issues in disappeared result type	2024-01-17 16:50:00 +01:00
Miriam Baglioni	59eaccbd87	[enrichment single step] refactoring to fix issue in disappeared result type	2024-01-15 17:49:54 +01:00
Giambattista Bloisi	21a14fcd80	Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions Implements pivots table update oozie workflow	2024-01-15 10:18:14 +01:00
Claudio Atzori	2d302e6827	Merge pull request '[FoS integration]fix issue on FoS integration. Removing the null values from FoS' (#375 ) from fosPreparationBeta into beta Reviewed-on: D-Net/dnet-hadoop#375	2024-01-12 10:27:28 +01:00
Miriam Baglioni	f612125939	fix issue on FoS integration. Removing the null values from FoS	2024-01-12 10:20:28 +01:00
Claudio Atzori	c67467723b	Merge pull request 'refined mapping for the extraction of the original resource type' (#374 ) from resource_types into beta Reviewed-on: D-Net/dnet-hadoop#374	2024-01-11 16:29:47 +01:00
Claudio Atzori	cb9e739484	Merge branch 'beta' into resource_types	2024-01-11 16:29:41 +01:00
Claudio Atzori	2753044d13	refined mapping for the extraction of the original resource type	2024-01-11 16:28:26 +01:00
Giambattista Bloisi	a88dce5bf3	Merge pull request 'Improvements and refactoring in Dedup' (#367 ) from dedup_increasenumofblocks into beta Reviewed-on: D-Net/dnet-hadoop#367	2024-01-11 11:24:06 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Antonis Lempesis	e024718f73	creating result_instances even when no pids exist for the instance	2024-01-10 22:25:50 +01:00

1 2 3 4 5 ...

4906 Commits All Branches Search

4906 Commits

All Branches