Claudio Atzori
a7a54aab47
WIP: align Solr JSON records to the explore portal requirements
2024-06-20 15:48:45 +02:00
Miriam Baglioni
eaa00a4199
[IrishFunderList]make changed according to 9635 comment 20, 21, 22 and 23
2024-06-20 12:32:57 +02:00
Claudio Atzori
fb731b6d46
WIP: align Solr JSON records to the explore portal requirements
2024-06-19 15:38:43 +02:00
Miriam Baglioni
b6da35e736
[IrishFunderList]make changed according to 9635 comment 14, 15 and 16
2024-06-19 11:06:58 +02:00
Lampros Smyrnaios
3c9b8de892
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Fix not breaking out of the VIEWS-infinite-loop when the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" is set to "false".
- Exit the script when no HDFS-active-node was found, independently of the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR".
- Fix view_name-recognition in a log-message, by using the more advanced "Perl-Compatible Regular Expressions" in "grep".
- Add error-handling for "compute stats" errors.
2024-06-18 15:59:34 +02:00
Antonis Lempesis
c67ef157d3
filtering out deletedbyinference and invinsible results from accessroute
2024-06-18 15:59:00 +02:00
Lampros Smyrnaios
c23f3031ed
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Show some counts and the elapsed time for various sub-tasks.
- Code polishing.
2024-06-18 15:58:46 +02:00
Claudio Atzori
8ec151aa3d
[graph indexing] comment out setting the JSON payload from the SolrInputDocuments
2024-06-18 15:53:24 +02:00
Claudio Atzori
2636936162
[IE OAI-PMH] fixed oozie wf definition
2024-06-14 11:47:37 +02:00
Miriam Baglioni
ef437a8cdf
[Provision]temporarily removed Json paylod from indexed records (Shadow cannot support it)
2024-06-13 16:48:03 +02:00
Miriam Baglioni
86088ef26e
Merge remote-tracking branch 'origin/beta_to_master_may2024' into beta_to_master_may2024
2024-06-11 17:04:07 +02:00
Miriam Baglioni
143c525343
[WebCrawl]remove relations for pid not doi
2024-06-11 17:03:59 +02:00
Claudio Atzori
c371513d43
[graph resolution] use sparkExecutorMemory to define also the memoryOverhead
2024-06-11 14:21:01 +02:00
Miriam Baglioni
3efd5b1308
[SDGActionSet]remove datainfo for the result. It is not needed (qualifier.classid = UPDATE) useless since subject do not go at the level of the instance
2024-06-11 10:35:57 +02:00
Miriam Baglioni
196fa55774
Merge remote-tracking branch 'origin/beta_to_master_may2024' into beta_to_master_may2024
2024-06-11 10:26:24 +02:00
Miriam Baglioni
50805e3fc1
[FoSActionSet]remove datainfo for the result. It is not needed (qualifier.classid = UPDATE) useless since subject do not go at the level of the instance
2024-06-11 10:25:46 +02:00
Claudio Atzori
d39a1054b8
[actionset promotion] use sparkExecutorMemory to define also the memoryOverhead
2024-06-10 16:15:07 +02:00
Claudio Atzori
576efc1857
hostedby patching to work with the updated Crossref contents
2024-06-10 15:22:33 +02:00
Claudio Atzori
efc1632e16
code formatting
2024-06-06 09:25:26 +02:00
Claudio Atzori
91b49366c6
[graph provision] align serialisation of the usage count measures to the agrred specifications
2024-06-05 16:34:40 +02:00
Claudio Atzori
5e05385d35
minor
2024-06-05 16:31:58 +02:00
Miriam Baglioni
c4d9b5b9d2
[downloadsAndViews]update the test file to consider the new serialization for downloads and views
2024-06-05 16:30:15 +02:00
Miriam Baglioni
bf9a5e6314
[downloadsAndViews]changed the test file to check the indicators are not there if their value is 0
2024-06-05 16:29:40 +02:00
Miriam Baglioni
9d79ddb3dd
[bulkTag] fixed issue that made project disappear in graph_10_enriched
2024-06-05 16:20:40 +02:00
Miriam Baglioni
907aa28c6c
[downloadsAndViews] fixed issue
2024-06-05 16:19:29 +02:00
Miriam Baglioni
3955ceaa76
[downloadsAndViews] changed the serialization for downloads and views
2024-06-05 16:18:46 +02:00
Miriam Baglioni
128c143394
{downloadsAndViews] extended test file with measures for downloads and views
2024-06-05 16:17:59 +02:00
Claudio Atzori
5133993ee5
Merge branch 'beta_to_master_may2024' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta_to_master_may2024
2024-06-05 12:17:48 +02:00
Claudio Atzori
5cf259a851
[graph2hive] use sparkExecutorMemory to define also the memoryOverhead
2024-06-05 12:17:16 +02:00
Claudio Atzori
e1828fc60e
Merge pull request '[PROD] Irish oaipmh exporter' ( #444 ) from irish-oaipmh-exporter into beta_to_master_may2024
...
Reviewed-on: #444
2024-06-05 10:56:20 +02:00
Claudio Atzori
81090ad593
[IE OAIPHM] added oozie workflow, minor changes, code formatting
2024-06-05 10:03:33 +02:00
Giambattista Bloisi
3feab5d92d
Fix MergeUtils.mergeGroup: it could get rid of some records and did not consider all PID authorities whilke sorting records.
...
ResultTypeComparator is now renamed in MergeEntitiesComparator and can be used as a general comparator for merging groups of records
2024-06-03 15:13:40 +02:00
Claudio Atzori
6be783caec
[graph cleaning] use sparkExecutorMemory to define also the memoryOverhead
2024-05-29 14:36:49 +02:00
Claudio Atzori
b703f94f09
Merge pull request 'changes in copy script - beta2master' ( #439 ) from antonis.lempesis/dnet-hadoop:beta into beta_to_master_may2024
...
Reviewed-on: #439
2024-05-29 14:29:26 +02:00
Miriam Baglioni
14f275ffaf
[NOAMI] removed Ireland funder id 501100011103. ticket 9635
2024-05-29 11:54:17 +02:00
Lampros Smyrnaios
e3f28338c1
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Assign the WRITE and EXECUTE permissions to the DBs' HDFS-directories, in order to be able to create tables on top of them, in the Impala Cluster.
- Make sure the "copydb" function returns early, when it encounters a fatal error, while respecting the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" config.
2024-05-28 17:51:45 +03:00
Claudio Atzori
8e45c5baa8
graph cleaning to implement ugly hardcoded rules
2024-05-28 15:28:42 +02:00
Claudio Atzori
db5e18c784
hostedby patching to work with the updated Crossref contents
2024-05-28 15:28:13 +02:00
Claudio Atzori
fb266efbcb
[org dedup] avoid NPEs in SparkPrepareNewOrgs
2024-05-26 21:23:30 +02:00
Claudio Atzori
d7daf54333
[org dedup] avoid NPEs in SparkPrepareOrgRels
2024-05-26 16:48:11 +02:00
Claudio Atzori
f99eaa0376
Merge branch 'beta_to_master_may2024' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta_to_master_may2024
2024-05-26 15:45:41 +02:00
Claudio Atzori
23312fcc1e
[org dedup] avoid NPEs in SparkPrepareOrgRels
2024-05-26 15:43:24 +02:00
Miriam Baglioni
b864f0adcf
Update to include a blackList that filters out the results we know are wrongly associated to IE - update workflow definition - the blacklist parameter
2024-05-24 16:01:19 +02:00
Miriam Baglioni
7a44869d87
Update to include a blackList that filters out the results we know are wrongly associated to IE - refactoring
2024-05-24 15:23:42 +02:00
Miriam Baglioni
12ffde023f
Update to include a blackList that filters out the results we know are wrongly associated to IE
2024-05-24 12:28:24 +02:00
Antonis Lempesis
15b54a345a
added fos lvl4
2024-05-24 13:21:28 +03:00
Lampros Smyrnaios
b48ed6e617
Change configuration in the copy-operation to Impala Cluster:
...
Set the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" parameter to "false".
2024-05-23 16:58:12 +03:00
Lampros Smyrnaios
68322843e2
Small updates to the copy-operation to Impala Cluster:
...
- Add a configuration-"switch" to control whether the script exits upon an error or not.
- Allow the script to exit when a table could not be created.
- Show the elapsed time for processing each database.
2024-05-23 15:07:49 +03:00
Lampros Smyrnaios
c7b32bbacc
Update CopyDataToImpalaCluster:
...
Update the code of acquiring the entities from Ocean cluster, through hive, in order to optimize the process and account for additional reserved keywords in Impala.
Co-authored-by: Antonis Lempesis <antleb@di.uoa.gr>
2024-05-23 13:00:19 +03:00
Claudio Atzori
c3fe59bc78
fixed conflicts merging from beta, code formatting
2024-05-21 14:50:40 +02:00
Claudio Atzori
1ea67eba82
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2024-05-21 13:48:48 +02:00
Claudio Atzori
834461ba26
[graph provision]fixed wf definition, revised serialization of the usage counts measures
2024-05-21 13:48:06 +02:00
Sandro La Bruzzo
032bcc8279
since last beta workflow we decide to introduce in the graph only MAG item with DOI and set them invisible ( this should be the same behaviour of the previous DOIBoost mapping).
...
This commit apply this type of mapping
2024-05-20 09:24:15 +02:00
Claudio Atzori
92f018d196
[graph provision] fixed path pointing to an intermediate data store in the working directory
2024-05-15 15:39:18 +02:00
Claudio Atzori
0611c81a2f
[graph provision] using Qualifier.classNames to populate the correponsing fields in the JSON payload
2024-05-15 15:33:10 +02:00
Michele Artini
2b3b5fe9a1
oai finalization and test
2024-05-15 14:13:16 +02:00
Claudio Atzori
1efe7f7e39
[graph provision] upgrade to dhp-schema:6.1.2, included project.oamandatepublications in the JSON payload mapping, fixed serialisation of the usageCounts measures
2024-05-14 12:39:31 +02:00
Claudio Atzori
f7d56e2ef2
Merge branch 'beta' into rest-collector-plugin-with-retry
2024-05-10 09:02:21 +02:00
Claudio Atzori
dc3a5858f7
Merge branch 'beta' into beta_provision_relation
2024-05-09 14:14:43 +02:00
Claudio Atzori
55f39f7850
[graph provision] adds the possibility to validate the XML records before storing them via the validateXML parameter
2024-05-09 14:06:04 +02:00
Claudio Atzori
39a2afe8b5
[graph provision] fixed XML serialization of the usage counts measures, renamed workflow actions to better reflect their role
2024-05-09 13:54:42 +02:00
Claudio Atzori
908ed9da7a
Merge pull request 'Various fixes in the stats wf' ( #430 ) from antonis.lempesis/dnet-hadoop:beta into beta
...
Reviewed-on: #430
2024-05-08 13:41:02 +02:00
Antonis Lempesis
0cada3cc8f
every step is run in the analytics queue. Hardcoded for now, will make a parameter later
2024-05-08 13:42:53 +03:00
Antonis Lempesis
90a4fb3547
fixed typos
2024-05-08 13:17:58 +03:00
Claudio Atzori
18aa323ee9
cleanup unused classes, adjustments in the oozie wf definition
2024-05-08 11:36:46 +02:00
Michele Artini
c9a327bc50
refactoring of gzip method
2024-05-08 11:34:08 +02:00
Michele Artini
e234848af8
oaf record: xpath for root
2024-05-08 10:00:53 +02:00
Claudio Atzori
b4e3389432
fixed property mapping creating the RelatedEntity transient objects. spark cores & memory adjustments. Code formatting
2024-05-07 16:25:17 +02:00
Giambattista Bloisi
711048ceed
PrepareRelationsJob rewritten to use Spark Dataframe API and Windowing functions
2024-05-07 15:44:33 +02:00
Michele Artini
70bf6ac415
oai exporter tests
2024-05-07 09:36:26 +02:00
Michele Artini
aa40e53c19
oai exporter parameters
2024-05-07 08:01:19 +02:00
Michele Artini
ed052a3476
job for the population of the oai database
2024-05-06 16:08:33 +02:00
Claudio Atzori
26363060ed
fixed id prefix creation for the fosnodoi records, again
2024-05-03 15:53:52 +02:00
Claudio Atzori
0486227185
[cleaning] deactivating the cleaning of FOS subjects found in the metadata provided by repositories
2024-05-03 14:31:12 +02:00
Claudio Atzori
e1a0fb8933
fixed id prefix creation for the fosnodoi records
2024-05-03 14:14:18 +02:00
Claudio Atzori
4355f64810
reverted to version 1.2.5-SNAPSHOT
2024-05-02 11:23:53 +02:00
Claudio Atzori
66680b8b9a
refactoring of common utilities
2024-05-02 11:16:58 +02:00
Claudio Atzori
dcf23b3d06
Merge branch 'beta' into beta-release-1.2.5
2024-05-02 10:01:49 +02:00
Michele Artini
f4068de298
code reindent + tests
2024-05-02 09:51:33 +02:00
Claudio Atzori
11bd89e132
[enrichment] use sparkExecutorMemory to define also the memoryOverhead
2024-05-01 08:32:59 +02:00
Claudio Atzori
e96c2c1606
[ranking wf] set spark.executor.memoryOverhead to fine tune the resource consumption
2024-04-30 16:23:25 +02:00
Claudio Atzori
50c18f7a0b
[dedup wf] revised memory settings to address the increased volume of input contents
2024-04-30 12:34:16 +02:00
Michele Artini
2615136efc
added a retry mechanism
2024-04-30 11:58:42 +02:00
Claudio Atzori
e2937db385
Merge branch 'beta' into misc_fixes_merge_entities
2024-04-24 08:55:28 +02:00
Giambattista Bloisi
1878199dae
Miscellaneous fixes:
...
- in Merge By ID pick by preference those records coming from delegated Authorities
- fix various tests
- close spark session in SparkCreateSimRels
2024-04-24 08:12:45 +02:00
Lampros Smyrnaios
49af2e5740
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Update the algorithm for creating views that depend on other views; overcome some bash-instabilities.
- Upon any error, fail the whole process, not just the current DB-creation, as those errors usually indicate a bug in the initial DB-creation, that should be fixed immediately.
- Enhance parallel-copy of large files by "hadoop distcp" command.
- Reduce the "invalidate metadata" commands to just the current DB's tables, in order to eliminate the general overhead on Impala.
- Show the number of tables and views in the logs.
- Fix some log-messages.
2024-04-23 17:15:04 +03:00
Antonis Lempesis
d2649a1429
increased the jvm ram
2024-04-23 16:03:16 +03:00
Claudio Atzori
c3053ef34d
using version 1.2.5-beta for the release
2024-04-23 14:52:32 +02:00
Claudio Atzori
b5bcab13ec
using version 1.2.5-beta for the release
2024-04-23 14:36:39 +02:00
Claudio Atzori
425c9afc36
using version 1.2.5-beta for the release
2024-04-23 14:30:04 +02:00
Claudio Atzori
93dd9cc639
code formatting
2024-04-23 11:28:00 +02:00
Miriam Baglioni
6189879643
[NOAMI] removed entry for Irish Research eLibray (IReL) Care Board from the list of funders.
2024-04-23 11:09:18 +02:00
Miriam Baglioni
7de114bda0
[WebCrawl] addressing comments from PR
2024-04-22 13:52:50 +02:00
Miriam Baglioni
776c898c4b
[WebCrawl] adding affiliation relations from web information
2024-04-22 11:04:17 +02:00
Claudio Atzori
0656ab2838
code formatting
2024-04-20 08:10:58 +02:00
Claudio Atzori
ab7f0855af
fixed query reading projects from the aggregator DB
2024-04-20 08:10:32 +02:00
Claudio Atzori
e5879b68c7
[transformative agreement] including reuslt-funder relations to the information imported from the TRs
2024-04-19 17:14:18 +02:00
Claudio Atzori
3a027e97a7
[graph indexing] sets spark memoryOverhead in the join operations to the same value used for the memory executor
2024-04-19 16:59:58 +02:00
Claudio Atzori
0c05abe50b
[graph indexing] sets spark memoryOverhead in the join operations to the same value used for the memory executor
2024-04-19 16:57:55 +02:00
Sandro La Bruzzo
b72c3139e2
updated Ignore annotation that is deprecated to Disabled
2024-04-19 14:52:40 +02:00