1
0
Fork 0
Commit Graph

303 Commits

Author SHA1 Message Date
Alessia Bardi e53228401b style 2021-12-09 15:46:22 +01:00
Alessia Bardi 6b5d7688a4 #7275 serialize license information in XML records 2021-12-09 13:46:48 +01:00
Claudio Atzori 9cac283bec implemented Instance serialization features requested in https://support.openaire.eu/issues/7156 2021-12-02 17:20:33 +01:00
Claudio Atzori 1de881b796 resolved conflicts for #165 2021-11-26 16:15:11 +01:00
Sandro La Bruzzo c9870c5122 code formatted 2021-10-19 15:24:59 +02:00
Claudio Atzori e471f12d5e hotfix: recovered implementation removing the hardcoded working_dirs 2021-10-19 12:35:38 +02:00
Claudio Atzori 14fbf92ad6 Merge branch 'beta' into beta_solr_config 2021-10-14 11:08:44 +02:00
Sandro La Bruzzo 5606014b17 code refactor see ticket #7065 2021-10-12 08:11:53 +02:00
Claudio Atzori 2f61054cd1 code formatting 2021-10-11 18:29:42 +02:00
Serafeim Chatzopoulos 201ce71cc1 Add resultsubject, relprojectname and resultacceptanceyear to __all field 2021-10-11 13:16:39 +03:00
Serafeim Chatzopoulos e468a7b96b Add tests to query Solr with different configurations 2021-10-08 16:58:51 +03:00
Serafeim Chatzopoulos de81007302 Add exploreTestConfig, a new Solr configuration folder 2021-10-08 16:54:56 +03:00
Alessia Bardi 8d3b60f446 test for patching records for EOSC Future 2021-10-07 17:30:45 +02:00
Alessia Bardi b924276e18 tests to generate records for the EOSC-Future demo with the EOSC Jupyter Notebbok subject 2021-09-24 17:11:56 +02:00
Sandro La Bruzzo d4dadf6d77 reduced max number of PID in Relatedentity 2021-09-02 14:21:24 +02:00
Sandro La Bruzzo 9f8a80deb7 fixed wrong import of unresolved relation in openaire 2021-09-01 14:16:27 +02:00
Alessia Bardi 3762b17f7b added VERSIOn and PART relationship and re-ordered according to my personal and obviously possibly biased
ordering
2021-08-31 20:20:05 +02:00
Alessia Bardi 931f430129 Merge branch 'beta' into datasource_model_eosc_beta 2021-08-23 11:57:21 +02:00
Claudio Atzori 9f4db73f30 updated/fixed unit tests 2021-08-11 15:02:51 +02:00
Claudio Atzori 2ee21da43b suggestions from SonarLint 2021-08-11 12:13:22 +02:00
Sandro La Bruzzo 6358f92c3a added sleep to solve problem of lost request of creating index 2021-07-30 08:54:37 +02:00
Claudio Atzori c53d106e80 [provision] lowercase relation filter 2021-07-29 13:57:00 +02:00
Sandro La Bruzzo 3721df7aa6 refactoring create actionset of scholexplorer, moved on package dhp-aggregation 2021-07-29 10:45:35 +02:00
Sandro La Bruzzo 3d8f0f629b implemented workflow of creation action set for scholexplorer 2021-07-28 16:15:34 +02:00
Alessia Bardi df8715a1ec format code after mvn compile 2021-07-28 11:58:26 +02:00
Michele Artini 3e2a2d6e71 added new fields in xml 2021-07-28 11:56:55 +02:00
Alessia Bardi c806387d4b tests for enermaps 2021-07-28 11:54:36 +02:00
Claudio Atzori 2fff24df55 code formatting 2021-07-28 11:34:19 +02:00
Sandro La Bruzzo 16c91203bd implemented workflow of creation action set for scholexplorer 2021-07-28 10:30:49 +02:00
Michele Artini 52e2315ba2 removed trick for datasourcetypeui 2021-07-28 10:23:00 +02:00
Claudio Atzori 10d7b4f0b4 filtering 'old' OpenAIRE ids from the entity.originalId[] array in the OAF -> XML searialization procedure 2021-07-20 11:52:05 +02:00
Sandro La Bruzzo bbe8193930 merged stable ids 2021-07-12 17:00:43 +02:00
Sandro La Bruzzo 57c74c73c6 fixed mistakes in oozie workflow 2021-07-09 12:28:09 +02:00
Sandro La Bruzzo 61ccb54fde removed wrong loop on oozie wf 2021-07-09 12:17:57 +02:00
Sandro La Bruzzo 9f5a0f3ab6 moved wf indexing of Scholexplorer in dhp-graph-provision 2021-07-09 12:06:43 +02:00
Claudio Atzori 96238152cb added serialization for alternateIdentifiers and pids within each record instance 2021-05-28 16:57:30 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori 609eb711b3 IndexRecordTransformerTest for producing a record that can be manually submitted to solr 2021-05-13 16:13:28 +02:00
Claudio Atzori 1517bf7c92 IndexRecordTransformerTest for producing a record that can be manually submitted to solr 2021-05-13 16:11:22 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00
Claudio Atzori 27ab8a704d adjusted poms to align with the external dhp-schema module 2021-04-27 10:12:27 +02:00
Claudio Atzori c2bb03c8b5 depending on external dhp-schemas module 2021-04-23 17:57:35 +02:00
Claudio Atzori 1e7e5180fa [Graph model] updated definition of ExternalReference: added alternateLabel, removed description (#6503) 2021-04-02 12:32:12 +02:00
Claudio Atzori 7941d7be29 WIP: using common definitions from ModelConstants 2021-03-31 18:33:57 +02:00
Claudio Atzori 72ce741ea6 WIP: using common definitions from ModelConstants 2021-03-31 17:07:13 +02:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Claudio Atzori 8d2bb24512 merged from master 2021-03-08 15:44:34 +01:00
Alessia Bardi 32e81c2d89 non validated rel has null value in validated field 2021-02-16 11:01:42 +01:00
Claudio Atzori 29c6f7e255 classes related to the collection workflow moved into common package; implemented MongoDB collection plugins 2021-02-12 12:31:02 +01:00
Claudio Atzori b34b5a39ca index field authoridtypevalue mixes up different author id-type value pairs, dropped in favour of orcidtypevalue 2021-02-11 09:36:04 +01:00
Alessia Bardi 986dd969d3 use the proper import for Lists 2021-02-10 12:03:54 +01:00
Alessia Bardi 09fc7e2f78 serialization of validated flag on relationships 2021-02-10 11:22:09 +01:00
Claudio Atzori 82e6c50f3f updated solr fields (authoridtypevalue, resultsubject, resultresourcetypename) 2021-02-09 16:27:04 +01:00
Claudio Atzori 62bd3c53ee Merge branch 'master' into provision_indexing 2021-02-09 15:46:26 +01:00
Claudio Atzori 72c57b28fa switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT 2021-02-04 14:08:18 +01:00
Claudio Atzori b6f08ce226 re-adding the old junit:junit dep as solr-test-framework needs it 2020-12-14 15:07:31 +01:00
Claudio Atzori 1506f49052 Xml record serialization for author PIDs: 1) only one value per PID type is allowed; 2) orcid prevails over orcid_pending 2020-12-14 11:14:03 +01:00
Claudio Atzori 61cd129ded XML serialisation test 2020-12-11 12:44:53 +01:00
Claudio Atzori ce7a319e01 using the correct assertion import 2020-12-11 12:44:17 +01:00
Claudio Atzori 7fe2433137 excluded transitive older junit dependencies, they can compromise the unit test executions 2020-12-11 12:42:55 +01:00
Claudio Atzori d9532446eb imported more diffs from master branch; code formatting 2020-12-10 16:14:16 +01:00
Claudio Atzori 12e2f930c8 resolved conflicts 2020-12-10 10:57:39 +01:00
Claudio Atzori ff72fcd91a allow orcid_pending to be percolate to the XML graph serialization 2020-12-09 19:04:50 +01:00
Claudio Atzori 211aa04726 allow orcid_pending to be percolate to the XML graph serialization 2020-12-09 18:08:51 +01:00
Claudio Atzori 026ad40633 disabled test 2020-12-07 13:50:01 +01:00
Claudio Atzori cfb55effd9 code formatting 2020-12-02 11:23:49 +01:00
Alessia Bardi 2d15667b4a testing XML generation from json object (case AMS ACTA) 2020-12-02 10:16:26 +01:00
Claudio Atzori d48f388fb2 Merge branch 'provision_indexing' 2020-11-19 15:59:55 +01:00
Claudio Atzori 7c9feaf9e7 project attributes removed from the XML record serialization: contactfullname, contactfax, contactphone, contactemail 2020-11-19 15:26:20 +01:00
Claudio Atzori 3f34757c63 merged from master 2020-11-19 14:34:54 +01:00
Claudio Atzori 0374d34c3e introduced configuration param outputFormat: HDFS | SOLR 2020-11-19 10:34:28 +01:00
Claudio Atzori 5218718e8b updated set of fields from the MDFormatDSResourceType on PROD 2020-11-18 15:00:41 +01:00
Claudio Atzori d9e07a242b extended XmlIndexingJob to accept an optional parameter: outputPath. When present, forces the job to write its output on the specified HDFS location 2020-11-18 14:34:55 +01:00
Claudio Atzori 29dcff0f34 spark complains about missing classes, so here they are again 2020-11-18 14:32:32 +01:00
Claudio Atzori 8177ce7939 test for XmlIndexingJob based on a local miniSolrCluster 2020-11-18 10:58:05 +01:00
Claudio Atzori 2bed29eb09 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:12 +01:00
Claudio Atzori 9b0fb9e958 merged from master 2020-11-12 09:27:12 +01:00
Claudio Atzori 822971f54f no need to filter relations in CreateRelatedEntitiesJob_phase1; replaced 'left outer' join with 'left' join in CreateRelatedEntitiesJob_phase2; cleanup; 2020-11-12 09:22:59 +01:00
Claudio Atzori 18d9aad70c improved documentation in dhp-graph-provision 2020-11-10 11:48:55 +01:00
Claudio Atzori 58f28296ea ProvisionConstants moved as ModelHardLimits in dhp-common and applied to truncate long abstracts (len > 150000). Further filtering for empty PID values 2020-10-30 10:56:42 +01:00
Claudio Atzori 1871d1c6f6 solve error java.lang.NoSuchFieldError: INSTANCE when instantiating Solr client 2020-08-14 11:18:30 +02:00
Claudio Atzori 3a11a387a9 data provision workflow enhancement: added nodes to perform DELETE BY QUERY before the indexing begins and COMMIT after the indexing is completed 2020-08-03 14:28:08 +02:00
Claudio Atzori cc5d13da85 introduced parameter shouldIndex (true|false) 2020-07-16 13:46:39 +02:00
Claudio Atzori b098cc3cbe avoid repeating identical values for fields: source, description 2020-07-16 13:45:53 +02:00
Claudio Atzori 7d6e269b40 reverted CreateRelatedEntitiesJob_phase1 to its previous state 2020-07-13 22:54:04 +02:00
Claudio Atzori 8e97598eb4 avoid to NPE in case of null instances 2020-07-13 20:46:14 +02:00
Claudio Atzori 06c1913062 added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations 2020-07-10 19:03:33 +02:00
Claudio Atzori 4c3836f62e materialize the related entities before joining them 2020-07-10 19:00:44 +02:00
Claudio Atzori b21866a2da allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types 2020-07-10 13:59:48 +02:00
Claudio Atzori ff4d6214f1 experimenting with pruning of relations 2020-07-10 10:06:41 +02:00
Claudio Atzori b383ed42fa pass optional parameter relationFilter to the PrepareRelationJob implementation 2020-07-07 14:21:28 +02:00
Claudio Atzori d380b85246 unit test for the preparation of the relations 2020-07-02 12:42:13 +02:00
Claudio Atzori 7817338e05 added test to verify the relation pre-processing 2020-06-26 17:58:33 +02:00
Claudio Atzori 8d59fdf34e WIP: dataset based PrepareRelationsJob 2020-06-26 14:32:58 +02:00
Claudio Atzori 216975c4ec restored complete provision workflow 2020-06-25 12:55:52 +02:00
Claudio Atzori 93f627ea51 code formatting 2020-06-25 12:54:21 +02:00
Claudio Atzori e62333192c WIP: prepare relation job 2020-06-25 12:22:18 +02:00
Claudio Atzori 6933ec11fb WIP: prepare relation job 2020-06-25 11:04:12 +02:00
Sandro La Bruzzo a6c0faac70 added test to verify secondary sorting 2020-06-25 10:48:15 +02:00
Claudio Atzori 69b0391708 WIP: prepare relation job 2020-06-25 10:19:56 +02:00
Claudio Atzori 46e76affeb WIP: prepare relation job 2020-06-24 19:01:15 +02:00
Claudio Atzori 0e723d378b added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job 2020-06-24 18:34:42 +02:00
Claudio Atzori 9cd27183b6 [maven-release-plugin] prepare for next development iteration 2020-06-22 11:27:44 +02:00
Claudio Atzori 1e3dab0631 [maven-release-plugin] prepare release dhp-1.2.3 2020-06-22 11:27:39 +02:00
Claudio Atzori c4d9f1837f [maven-release-plugin] prepare for next development iteration 2020-06-12 12:21:08 +02:00
Claudio Atzori f0746a7605 [maven-release-plugin] prepare release dhp-1.2.2 2020-06-12 12:21:03 +02:00
Claudio Atzori 463489f59f code formatting 2020-06-12 12:03:25 +02:00
Claudio Atzori 4bcad1c9c3 Merge branch 'graph_cleaning' 2020-06-12 11:40:25 +02:00
Alessia Bardi e79943965b Fixes #5604: field oamandatepublications in XML 2020-06-11 12:49:31 +02:00
Claudio Atzori 67c7b31ba6 Merge branch 'master' into graph_cleaning 2020-06-10 15:00:35 +02:00
Claudio Atzori ce12f236bb disabled test, need to need to update the joined_entity.json file 2020-06-09 20:07:36 +02:00
Claudio Atzori a2fdf85ba1 WIP: graph cleaner implementation 2020-06-09 19:52:53 +02:00
Claudio Atzori 05f269a1c0 kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-06-01 00:32:42 +02:00
Claudio Atzori 6f5f498c78 restored common properties driving executor-cores and executor-memory in join_organization_relations wf node 2020-05-29 11:22:00 +02:00
Claudio Atzori b2f9564f13 WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-05-29 10:58:15 +02:00
Claudio Atzori a57965a3ea limiting the dimensions of outliers 2020-05-28 17:36:37 +02:00
Claudio Atzori 821be1f8b6 experimental implementation of custom aggregation using kryo encoders 2020-05-28 13:53:13 +02:00
Claudio Atzori 83504ecace limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit 2020-05-28 13:52:30 +02:00
Claudio Atzori ef11593068 JoinedEntity.links defined as empty list by default 2020-05-28 13:50:44 +02:00
Claudio Atzori 5dea155a87 increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase 2020-05-28 13:49:59 +02:00
Claudio Atzori fdd54bad1c code formatting 2020-05-27 19:31:54 +02:00
Claudio Atzori cfd753217c repartition the join_entities in 24k files 2020-05-27 12:44:01 +02:00
Claudio Atzori 2f1a623d09 sync from master branch 2020-05-27 12:39:58 +02:00
Claudio Atzori 9e4ec1543b updated test 2020-05-27 12:38:42 +02:00
Claudio Atzori 8047d16dd9 added RDD based adjacency list creation procedure 2020-05-27 12:38:12 +02:00
Claudio Atzori f057dcdf65 limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES 2020-05-27 12:37:33 +02:00
Claudio Atzori 4e36d689dd fixed XML serialization for children sub-elements (duplicates & externalreferences) 2020-05-26 18:30:40 +02:00
Claudio Atzori b8e541a454 fixing repeated organization.websiteurl in organization entities (#5645) as well as project.ecinternationalorganizationeurinterests 2020-05-26 10:30:09 +02:00
Claudio Atzori 7582532e73 [maven-release-plugin] prepare for next development iteration 2020-05-25 19:48:18 +02:00
Claudio Atzori 01c2e93395 [maven-release-plugin] prepare release dhp-1.2.1 2020-05-25 19:48:14 +02:00
Claudio Atzori 925d933204 making XmlRecordFactory immune to graph encoding changes (mostly to avoid NPEs) 2020-05-22 08:50:44 +02:00
Claudio Atzori b33dd58be4 replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging 2020-05-22 08:50:06 +02:00
Claudio Atzori dbfb9c19fe minor changes 2020-05-21 10:00:14 +02:00
Claudio Atzori d7d2a0637f added extra parameters to the provision indexing workflow 2020-05-20 14:55:38 +02:00
Claudio Atzori 0bdfbb0a57 reintroduced RDD based relation cut off procedure 2020-05-19 15:02:21 +02:00
Claudio Atzori 60c40618d3 [maven-release-plugin] prepare for next development iteration 2020-05-11 10:17:14 +02:00
Claudio Atzori c267d958d5 [maven-release-plugin] prepare release dhp-1.2.0 2020-05-11 10:17:10 +02:00
Claudio Atzori 42f1a2bf94 bumped project version to 1.2.0-SNAPSHOT 2020-05-11 10:05:57 +02:00
Claudio Atzori 0ccc864ad9 [maven-release-plugin] prepare for next development iteration 2020-05-08 17:01:31 +02:00
Claudio Atzori 6e47c724c6 [maven-release-plugin] prepare release dhp-1.1.7 2020-05-08 17:01:27 +02:00
Claudio Atzori 8c67073a07 force speculative execution to false 2020-05-08 09:42:21 +02:00
Claudio Atzori bac37b3973 fixed children expansion in XML records 2020-05-04 11:51:17 +02:00
Claudio Atzori 077ccd8743 stats wf properties cleanup 2020-05-04 11:41:46 +02:00
Claudio Atzori 439c6255a2 cleanup 2020-04-29 19:09:07 +02:00
Claudio Atzori 6f5b899038 reformatted code according to the updated style descriptor 2020-04-28 11:23:29 +02:00
Claudio Atzori a0bdbacdae switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:52:31 +02:00
Claudio Atzori 7a3f8085f7 switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:45:40 +02:00
Claudio Atzori 1e7583c5a6 filtered invisible records in data provision workflow 2020-04-23 07:51:34 +02:00
Claudio Atzori 0b55795d4d small adjustments in the provisioning workflow 2020-04-21 16:15:04 +02:00
Claudio Atzori d772d967aa restored changes from master branch 2020-04-20 18:53:06 +02:00