dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Michele Artini	cb29b9773c	xslt rules	2024-03-18 15:31:34 +01:00
Michele Artini	85b844d57e	updated BASE filter param	2024-03-15 15:03:27 +01:00
Michele Artini	455f2e1e07	apply commits from master	2024-03-15 14:56:39 +01:00
Michele Artini	30167aa882	mapped oaf:country from results	2024-03-15 11:24:16 +01:00
Michele Artini	88fef367b9	new plugin to collect from a dump of BASE	2024-03-15 10:47:52 +01:00
Claudio Atzori	078169b922	cleanup	2024-03-15 09:56:04 +01:00
Claudio Atzori	af154d4456	implemented changes from #9497 : sort abstracts by string length, included author fullnames in the related results, expanded instance details within each children/result XML element	2024-03-14 16:21:23 +01:00
Claudio Atzori	7863c92466	expanded paper abstract in the result/children XML element (ticket #9497 )	2024-03-13 16:25:31 +01:00
Claudio Atzori	eb5887cb9a	including related organization url in the XML record serialization (ticket #9498 )	2024-03-13 14:46:00 +01:00
Miriam Baglioni	5a32bb9578	[OC New] last fix	2024-03-13 09:36:18 +01:00
Miriam Baglioni	48c052215c	[OC New] last fix	2024-03-12 23:12:32 +01:00
Claudio Atzori	db66555ebb	WIP: updated provision workflow to create a JSON based representation of the payload	2024-03-12 09:56:09 +01:00
Antonis Lempesis	f74c7e8689	selecting distinct peer_reviewed	2024-03-12 02:13:04 +02:00
Giambattista Bloisi	9092075760	Enrich authors with ORCID info using new matching algorithm	2024-03-11 13:23:59 +01:00
Claudio Atzori	d4871b31e8	WIP: extended provision workflow to create the JSON based payload	2024-03-08 11:43:20 +01:00
Antonis Lempesis	3c79720342	fixed the irish result subset	2024-03-07 14:08:57 +02:00
Antonis Lempesis	5ae4b4286c	Merge branch 'beta' of https://code-repo.d3science.org/antonis.lempesis/dnet-hadoop into beta	2024-03-07 12:15:19 +02:00
Miriam Baglioni	5180b6ec8a	[FOSNEW] removed test class	2024-03-07 10:47:13 +01:00
Miriam Baglioni	7827a2d66b	[OCNEW] added creation of the actionset for the results classified with FoS based ont he OpenAIRE identifier	2024-03-07 10:36:30 +01:00
Antonis Lempesis	316d585c8a	using distinct apcs per publication to avoid huge sums	2024-03-07 02:07:59 +02:00
Miriam Baglioni	fd34372c40	[OCNEW] first implementation	2024-03-06 13:42:00 +01:00
Giambattista Bloisi	3cd5590f3b	When converting json to XML, remove characters that are not allowed in the XML 1.0 specs, as they will cause xpath failures even if escaped	2024-02-28 15:14:18 +01:00
Giambattista Bloisi	56dd05f85c	Merge pull request 'Revised procedure when converting json data into xml' (#395 ) from restiterator_xmlcleanup into beta Reviewed-on: D-Net/dnet-hadoop#395	2024-02-28 10:38:54 +01:00
Claudio Atzori	6fcf872daa	Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into index_records	2024-02-28 10:27:28 +01:00
Claudio Atzori	3f07390a58	WIP	2024-02-28 10:10:10 +01:00
Sandro La Bruzzo	7d806a434c	formatted code	2024-02-28 09:31:58 +01:00
Sandro La Bruzzo	b63994dcc4	Merge remote-tracking branch 'origin/beta' into orcid_update	2024-02-28 09:11:18 +01:00
Sandro La Bruzzo	915a76a796	following the comment on the pull requests: - Added #NUM_OF_THREADS complete job in the queue at the end of the main loop to avoid deadlock	2024-02-28 09:10:55 +01:00
Giambattista Bloisi	773e856550	Revised procedure when converting json data into xml: - json object keys are renamed to be conformant to xml tag elements, special characters are substituted or removed - json string values are no longer post-processed as they are already escaped by the org.json.XML.toString method	2024-02-24 16:54:30 +01:00
Sandro La Bruzzo	a712df1e1d	Merge remote-tracking branch 'origin/beta' into orcid_update	2024-02-23 10:12:25 +01:00
Sandro La Bruzzo	b32a9d1994	Implemented workflow for updating table , added step to check if the new generated table is valid	2024-02-23 10:04:28 +01:00
Michele Artini	3268570b2c	mapping of project PIDs	2024-02-22 14:47:21 +01:00
Miriam Baglioni	72bae7af76	[Transformative Agreement] removed the relations from the ActionSet waiting to have the gree light from Ioanna	2024-02-19 16:20:12 +01:00
Miriam Baglioni	43da7e1191	[Tagging Projects and Datasource] changed the way the pathMap parameter is passed. It was too long and was truncated	2024-02-19 16:12:59 +01:00
Serafeim Chatzopoulos	f0dc12634b	Add Action Set creation for affiliations inferred from the OpenAPC data	2024-02-18 18:02:09 +02:00
Claudio Atzori	a63b091bae	Merge branch 'beta' into import_orps_fix	2024-02-15 15:01:56 +01:00
Miriam Baglioni	8dae10b442	-	2024-02-14 14:57:08 +01:00
Miriam Baglioni	83bb97be83	[Tagging Projects and Datasource] added test to check datasource tagging. Fixed issue	2024-02-14 11:23:47 +01:00
Miriam Baglioni	6e1f383e4a	[Tagging Projects and Datasource] first extention of bulktagging to add the context to projects and datasource	2024-02-13 16:37:14 +01:00
Miriam Baglioni	3f7d262a4e	mergin with branch beta	2024-02-13 14:05:58 +01:00
Miriam Baglioni	eca021f4d6	[Transformative Agreement] add results with information abount the agreement and the country of the organization paid for it	2024-02-13 12:21:07 +01:00
Miriam Baglioni	bdb6bbb365	mergin with branch beta	2024-02-12 15:50:43 +01:00
Claudio Atzori	d85d2df6ad	[graph raw] fixed mapping of the original resource type from the Datacite format	2024-02-09 10:20:20 +01:00
Giambattista Bloisi	b19643f6eb	Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup	2024-02-08 15:34:59 +01:00
Antonis Lempesis	dd4c27f4f3	added 2 new institutions in monitor	2024-02-08 12:57:57 +02:00
Claudio Atzori	38c9001147	fixed import of ORPs stored on HDFS in the internal graph format (e.g. Datacite)	2024-02-07 17:02:05 +01:00
Claudio Atzori	fd17c1f17c	[actiosets] fixed join type	2024-02-05 16:55:36 +02:00
Claudio Atzori	009dcf6aea	[actiosets] introduced support for the PromoteAction strategy	2024-02-05 16:43:40 +02:00
Claudio Atzori	42f5506306	[orcid enrichment] fixed directory cleanup before distcp	2024-02-05 09:45:36 +02:00
Alessia Bardi	f2a08d8cc2	test for Italian records from IRS repositories	2024-01-30 19:20:14 +01:00
Antonis Lempesis	a512ead447	changed orcid ids to all capital	2024-01-30 16:54:47 +02:00
Miriam Baglioni	07a373a0bd	[bulkTagging] removing checks while performing the substring action so that it will fire an Exception if the paramneters are wrongly set	2024-01-30 13:51:11 +01:00
Miriam Baglioni	ead08b0dd4	mergin with branch beta	2024-01-30 12:19:10 +01:00
Antonis Lempesis	bb10a22290	merged changes from dnet-hadoop	2024-01-29 21:51:47 +02:00
Miriam Baglioni	a5995ab557	[orcid-enrichment] change the value of parameters.	2024-01-29 18:19:48 +01:00
Miriam Baglioni	a418dacb47	[UsageCount] code extention to include also the name of the datasource	2024-01-29 18:12:33 +01:00
Miriam Baglioni	e9131f4e4a	mergin with branch beta	2024-01-29 16:27:18 +01:00
Sandro La Bruzzo	9aebca77a0	Added exception throwing in Hadoop transformation when TR is not syntactically valid	2024-01-29 14:41:02 +01:00
Claudio Atzori	926903b06b	Merge branch 'beta' into stats_with_spark_sql	2024-01-29 09:11:45 +01:00
Giambattista Bloisi	078df0b4d1	Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf	2024-01-26 21:56:55 +01:00
Claudio Atzori	ce3200263e	Merge branch 'beta' into crossref_missing_author_fix	2024-01-26 15:57:04 +01:00
Sandro La Bruzzo	e889808daa	Fixed problem on missing author in crossref Mapping	2024-01-26 12:19:04 +01:00
Antonis Lempesis	c548796463	Changed step16-createIndicatorsTables to use a spark oozie action instead of hive	2024-01-26 02:04:48 +02:00
Sandro La Bruzzo	0386f36385	Added workflow to update ORCID and replaced some parsing, because the update works and employments xml differs from the dump one.	2024-01-25 19:40:59 +01:00
Antonis Lempesis	a7115cfa9e	max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.	2024-01-25 15:13:16 +01:00
Antonis Lempesis	fd43b0e84a	max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.	2024-01-25 15:06:34 +01:00
Claudio Atzori	9b13c22e5d	[graph provision] retrieve all the context information by adding all=true to the requests issued to thr API	2024-01-23 15:36:08 +01:00
Sandro La Bruzzo	43e0bba7ed	logg added during download	2024-01-23 15:04:49 +01:00
Miriam Baglioni	f7d06dc661	compilation after merging	2024-01-23 11:43:08 +01:00
Miriam Baglioni	6e58d79623	mergin with branch beta	2024-01-23 11:36:47 +01:00
Miriam Baglioni	e0ec800d7e	[BulkTagging] extend the definition of the pathMap to include also actions that should be performed of the value extracted from the result befor applying the constraint	2024-01-23 11:34:53 +01:00
Claudio Atzori	f87f3a6483	[graph provision] updated param specification for the XML converter job	2024-01-23 08:54:37 +01:00
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Claudio Atzori	f76852f385	Merge branch 'beta' into update_pivots_table	2024-01-22 16:37:22 +01:00
Claudio Atzori	1c6db320f4	[graph provision] obtain context info from the context API instead from the ISLookUp service	2024-01-22 15:53:17 +01:00
Claudio Atzori	2655eea5bc	[orcid enrichment] drop paths before copying the non-modifyed contents	2024-01-19 16:28:05 +01:00
Claudio Atzori	c6b3401596	increased shuffle partitions for publications in the country propagation workflow	2024-01-19 10:15:39 +01:00
Miriam Baglioni	bcc0a13981	[enrichment single step] adding <end> element in wf definition	2024-01-18 17:39:14 +01:00
Miriam Baglioni	6af536541d	[enrichment single step] moving parameter file in correct location	2024-01-18 15:35:40 +01:00
Miriam Baglioni	a12a3eb143	-	2024-01-18 15:18:10 +01:00
Miriam Baglioni	82e9e262ee	[enrichment single step] remove parameter from execution	2024-01-17 17:38:03 +01:00
Miriam Baglioni	67ce2d54be	[enrichment single step] refactoring to fix issues in disappeared result type	2024-01-17 16:50:00 +01:00
Miriam Baglioni	59eaccbd87	[enrichment single step] refactoring to fix issue in disappeared result type	2024-01-15 17:49:54 +01:00
Giambattista Bloisi	21a14fcd80	Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions Implements pivots table update oozie workflow	2024-01-15 10:18:14 +01:00
Sandro La Bruzzo	e0753f19da	Fixed error of connection timeout	2024-01-13 09:27:08 +01:00
sandro.labruzzo	e328bc0ade	fixed missing parameter on download update	2024-01-12 16:18:20 +01:00
Miriam Baglioni	f612125939	fix issue on FoS integration. Removing the null values from FoS	2024-01-12 10:20:28 +01:00
Claudio Atzori	cb9e739484	Merge branch 'beta' into resource_types	2024-01-11 16:29:41 +01:00
Claudio Atzori	2753044d13	refined mapping for the extraction of the original resource type	2024-01-11 16:28:26 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Antonis Lempesis	e024718f73	creating result_instances even when no pids exist for the instance	2024-01-10 22:25:50 +01:00
Sandro La Bruzzo	859babf722	added some useful comment	2024-01-10 19:51:13 +01:00
Sandro La Bruzzo	39ebb60b38	Merge remote-tracking branch 'origin/beta' into orcid_update	2024-01-10 19:50:00 +01:00
Sandro La Bruzzo	9d5a7c3b22	code refactor	2024-01-10 19:42:34 +01:00
Sandro La Bruzzo	8f61063201	Added workflow	2024-01-10 19:42:22 +01:00
Sandro La Bruzzo	1a42a5c10d	Implemented Download update of ORCID	2024-01-10 18:03:20 +01:00

1 2 3 4 5 ...

4210 Commits