Compare commits

..

106 Commits

Author SHA1 Message Date
Claudio Atzori a6da42a2e8 Merge pull request 'Update Gtr2 plugin' (#518) from beta-ukripublication into beta
Reviewed-on: #518
2024-12-20 10:11:34 +01:00
Claudio Atzori 7ff4111357 Merge pull request 'ConnectSubCommunities' (#523) from COnnectSubCommunities into beta
Reviewed-on: #523
2024-12-20 10:11:13 +01:00
Miriam Baglioni 849b75593e resolved conflicts 2024-12-20 09:21:22 +01:00
Miriam Baglioni 2d45f125a7 [bulktag subcommunities] refactoring and addition of new properties 2024-12-20 09:06:55 +01:00
Giambattista Bloisi 3ad3a56868 Merge pull request 'Implement new jobs for constructing the graph incrementally' (#522) from incremental_graph into beta
Reviewed-on: #522
2024-12-19 15:14:41 +01:00
Giambattista Bloisi 85dced4ffb Implement new jobs for collecting data from latest graph on hive and deltas from oaf mdstores (datacite and crossref)
Optimized CopyHdfsOafSparkApplication
2024-12-19 14:37:48 +01:00
sandro.labruzzo dccbcfd36c code formatted 2024-12-13 11:48:32 +01:00
sandro.labruzzo b039952d97 bug fixed on zenodo plugin 2024-12-13 10:43:27 +01:00
Miriam Baglioni 29a2a29666 Merge pull request '[research_fi] added plugin name to collectorplugins' (#519) from beta_researchfi into beta
Reviewed-on: #519
2024-12-12 09:12:05 +01:00
Miriam Baglioni 1b1fb9f1c2 [research_fi] added plugin name to collectorplugins 2024-12-11 16:38:02 +01:00
Miriam Baglioni ce22b1d536 [gtr2 plugin] changed to try not to die if one publication link point to the website of the project 2024-12-11 16:33:51 +01:00
Giambattista Bloisi 101d9e830d JsonListMatch do not lower the extracted strings
Fix test configurations and assertions
2024-12-11 15:59:13 +01:00
Miriam Baglioni 19a9bddab1 [gtr2 plugin] changed to try not to die if one publication link point to the website of the project 2024-12-10 16:26:24 +01:00
Miriam Baglioni 69dad7e2bf [gtr2 plugin] removed unused import 2024-12-10 14:17:34 +01:00
Miriam Baglioni 9657707ab0 [gtr2 plugin] changed according to the new apis endpoint and response 2024-12-10 14:15:38 +01:00
Sandro La Bruzzo dd6ed31383 Merge remote-tracking branch 'origin/beta' into beta 2024-12-06 14:23:58 +01:00
Sandro La Bruzzo 0d05006114 code formatted 2024-12-06 14:23:47 +01:00
Claudio Atzori e4b814b3f1 code formatting 2024-12-06 13:58:39 +01:00
Claudio Atzori 5c7f7fb3b8 Merge pull request 'Add Collector Plugin for Zenodo Dumps' (#516) from zenodo_dump_collection into beta
Reviewed-on: #516
2024-12-06 13:51:10 +01:00
Claudio Atzori 9e6b1f2f24 Merge pull request 'Communities_patents' (#514) from Communities_patents into beta
Reviewed-on: #514
2024-12-06 13:50:43 +01:00
Miriam Baglioni 666155bafa [communityfromsemrelpropagation] changed resource to have deletedbyinference = false. 2024-12-06 12:26:41 +01:00
Miriam Baglioni ee84db7a6a [communityfromsemrelpropagation] added filtering to remove the deletedbyinference and invisible results 2024-12-06 12:20:13 +01:00
Claudio Atzori 77308ed525 Merge pull request 'Crossref Enhancements:' (#511) from crossref_mapping_improvement into beta
Reviewed-on: #511
2024-12-06 11:48:57 +01:00
Miriam Baglioni 302c4d044e Merge branch 'beta' into crossref_mapping_improvement 2024-12-06 11:45:37 +01:00
Claudio Atzori 60da306830 Merge pull request 'raid actionset wf' (#517) from raid_actionset into beta
Reviewed-on: #517
2024-12-06 10:04:20 +01:00
Claudio Atzori 8a5ba8df45 minor changes 2024-12-06 10:03:11 +01:00
Claudio Atzori dade7d5bb8 minor changes 2024-12-06 10:02:07 +01:00
Claudio Atzori f57446ad16 merge from beta 2024-12-06 09:50:42 +01:00
Michele De Bonis 1c144a4dcb minor change 2024-12-06 09:18:10 +01:00
Sandro La Bruzzo fd1038b44d removed a sneaky break that was committed by mistake. 2024-12-06 09:12:06 +01:00
Giambattista Bloisi fed13e083e Fix: do not import joda
formatting
2024-12-05 15:21:32 +01:00
Michele De Bonis 6af3fd16b6 attributes fixes 2024-12-05 14:39:42 +01:00
Michele De Bonis bde59a7c8f implementation of the utilities for the inclusion of raids in the graph 2024-12-05 11:09:30 +01:00
sandro.labruzzo 730a7751b6 added zenodoDump to enum of CollectorPlugin 2024-12-04 15:03:59 +01:00
Miriam Baglioni 89b7bc84f2 Merge pull request 'support of the new research.fi apis' (#515) from update_researchfi_plugin into beta
Reviewed-on: #515
2024-12-04 13:53:01 +01:00
sandro.labruzzo 5f134c4045 Merge remote-tracking branch 'origin/beta' into zenodo_dump_collection 2024-12-04 13:41:25 +01:00
sandro.labruzzo 4034da7579 code formatted 2024-12-04 13:37:14 +01:00
sandro.labruzzo 32e2a8b340 implemented zenodo dump collector plugin 2024-12-04 13:36:21 +01:00
Michele Artini 65902a87e3 support of the new apis 2024-12-04 13:18:17 +01:00
sandro.labruzzo cc6bbbb804 make setter void 2024-12-03 14:31:11 +01:00
sandro.labruzzo 0517e452e3 Fixed error on empty affiliation 2024-12-02 14:00:59 +01:00
Miriam Baglioni ca2d480df3 [BulkTagging] added fix to consider when the set of constraints for the datasource is empty. Added check for remove constraints and advanced constraints to verify if the constraints list is empty and in that case do nothing 2024-11-26 15:56:52 +01:00
Claudio Atzori 2e54715d71 Applying PR#512 - Sequential ActionSet promotion 2024-11-26 15:56:46 +01:00
Miriam Baglioni 189a7c255a [patents] added test and resources 2024-11-25 16:52:13 +01:00
Miriam Baglioni 821700299a [patents] added test and resources 2024-11-22 17:21:58 +01:00
Miriam Baglioni 2570023590 [Subcommunities] modified bulktagging workflow to include the new parameters 2024-11-21 14:47:17 +01:00
Miriam Baglioni c0729ac279 [Subcommunities] added remapping to master datasource 2024-11-21 14:36:26 +01:00
Miriam Baglioni ab96983647 [Subcommunities] added remapping to representative organization 2024-11-21 12:35:05 +01:00
Miriam Baglioni 0656ed568d [Subcommunities] remove not needed methods used to create datasourceCommunityMap 2024-11-21 11:05:58 +01:00
Miriam Baglioni ba9f1982b3 [Subcommunities] used the two new access point to directly get the organizationCOmmunityMap and the datasourceCommunityMap 2024-11-21 11:04:58 +01:00
Miriam Baglioni 9ee061ee90 [Subcommunities] added to the list of the communities also the sub community identifiers 2024-11-21 11:02:52 +01:00
Miriam Baglioni e5b04e61ff [CommunityPatents] extends the community propagation considering also the results of type patents linked with a isrelatedto semantcis 2024-11-21 10:20:12 +01:00
Miriam Baglioni 896de42598 [CommunityAPI] use of new access point to directly get the organizationCommunityMap and the datasouceCommunityMap for all the communities and subcommunities. To be changed in the propagation code when implemented in the APIs 2024-11-20 17:44:33 +01:00
Claudio Atzori 15227f82b8 added related author's given name and family name in the solr json payload serialisation 2024-11-20 15:52:40 +01:00
Miriam Baglioni 3081cad1d3 [CommunityAPI] refactoring 2024-11-20 14:03:59 +01:00
Miriam Baglioni 6beb94adee [SubCommunity] Extention of the Utils methods to add also the associations between the subcommunities and organization/project/datasources 2024-11-20 10:59:49 +01:00
sandro.labruzzo ac8995ab64 Merge remote-tracking branch 'origin/beta' into crossref_mapping_improvement 2024-11-20 09:52:51 +01:00
sandro.labruzzo 496007188a Added assertion on CrossrefMappingTest 2024-11-20 09:50:09 +01:00
Miriam Baglioni 9dbcf19efb [SubCommunity] Extention of communityApis to add also the associations between the subcommunities and organization/project/datasources 2024-11-20 09:16:33 +01:00
Claudio Atzori 4e55ddc547 [PubMed aggregation] storing contents into mdStoreVersion/store 2024-11-19 16:50:42 +01:00
Claudio Atzori ef51a60f19 Merge pull request 'dedup_new_comparators' (#509) from dedup_new_comparators into beta
Reviewed-on: #509
2024-11-19 15:13:40 +01:00
Claudio Atzori ff5cb32067 Merge pull request 'abstracts in ODF records from the datacite and the dc nsPrefixes' (#508) from abtracts_guidelines4 into beta
Reviewed-on: #508
2024-11-19 15:12:53 +01:00
Claudio Atzori a48d080e08 Merge pull request 'Improve OAF Generation from Baseline PubMed Collection' (#504) from pubmed_fix into beta
Reviewed-on: #504
2024-11-19 15:12:37 +01:00
Claudio Atzori 5d34432398 align MergeUtils with beta branch 2024-11-19 15:12:04 +01:00
sandro.labruzzo a1297082e2 Crossref Enhancements:
-Accurate Review Type Assignment: Resolved an issue identified in ticket https://support.openaire.eu/issues/9525#note-13. When a relationship of "is-review-of" is detected, the publication type is now correctly set to "Review."
-Enhanced Author Affiliation Data: Implemented Miriam's suggestion by including a new field, "RawAffiliationString," in each author entry. This additional data provides a more granular level of detail regarding author affiliations, potentially improving discoverability and research analysis.
2024-11-19 14:57:18 +01:00
Miriam Baglioni cea2de2c37 [SubCommunity] Extention of CommunityAPIs fro bulk tagging 2024-11-19 14:50:42 +01:00
Michele De Bonis c97facf5e6 conflict resolution in the comparator test class 2024-11-18 14:59:30 +01:00
Claudio Atzori 9e439f5eca map the abstracts considering both the datacite and the dc nsPrefix 2024-11-15 12:19:26 +01:00
Claudio Atzori cf7d9a32ab disable autoBroadcastJoin in the cleaning workflow 2024-11-15 09:17:28 +01:00
Claudio Atzori 5f512f510e code formatting 2024-11-15 09:16:51 +01:00
Claudio Atzori b95672b420 mergeUtils set the result identifier when enforcing the result type 2024-11-15 09:16:18 +01:00
Claudio Atzori 9e8849b753 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-11-13 20:41:51 +01:00
sandro.labruzzo 4778a70478 Merge remote-tracking branch 'origin/beta' into pubmed_fix 2024-11-13 16:28:39 +01:00
Claudio Atzori 4a3b173ca2 defaults to 0000 - Unknown in case the instance type lookup in the dnet:result_typologies doesn't find a corresponding result type binding 2024-11-13 16:27:00 +01:00
sandro.labruzzo ac0a94d62d updated pubmed parser to add also ORCID id and affiliation string to authors 2024-11-13 16:26:59 +01:00
Giambattista Bloisi 5ee8881646 Merge pull request '[danishfunders] added link for danish funders versus the unidentified project for IRFD (501100004836) CF (501100002808) and NNF(501100009708)' (#502) from danishFunders_crossrefmap into beta
Reviewed-on: #502
2024-11-13 12:01:38 +01:00
Miriam Baglioni fb1f0f8850 [danishfunders] added the possibility to link also versus a specif award if present in the metadata 2024-11-13 12:00:33 +01:00
Giambattista Bloisi 5b4d821bf9 Merge pull request 'Crossref: generate canonical openaire id for results in affiliation relationship' (#507) from fix_crossref_affiliations into beta
Reviewed-on: #507
2024-11-13 11:01:37 +01:00
Giambattista Bloisi 03c262ccb9 Crossref: generate canonical openaire id for results in affiliation relationship 2024-11-13 10:56:17 +01:00
sandro.labruzzo a1d5ad5c26 code formatted 2024-11-13 09:51:13 +01:00
sandro.labruzzo b0478c380e merged conflicts on beta 2024-11-13 09:43:16 +01:00
Claudio Atzori 07f267bb10 fix vocabulary lookup in mergeutils 2024-11-13 08:14:26 +01:00
Claudio Atzori 8088943399 Merge pull request 'enforce resulttype' (#506) from merge_resulttypes into beta
Reviewed-on: #506
2024-11-12 14:20:22 +01:00
Claudio Atzori 6c5df761e2 enforce resulttype based on the dnet:result_typologies vocabulary and upon merge 2024-11-12 14:18:04 +01:00
Claudio Atzori 9f7a606ddd Merge pull request 'betaFixPerson' (#505) from betaFixPerson into beta
Reviewed-on: #505
2024-11-12 14:09:22 +01:00
Miriam Baglioni 250f101779 [person] fixed issue in creating project identifier for the graph for person->project relations 2024-11-11 16:04:06 +01:00
Miriam Baglioni f1ea9da5bc [person] checked type in inferenceprovenance 2024-11-11 15:37:56 +01:00
Miriam Baglioni b0283fe94c [person] fix provenance of pid in person when it is orcid (classid entityregistry to avoid the cleaning put orcid_pending) 2024-11-11 14:57:57 +01:00
sandro.labruzzo 474f365286 removed wrong test 2024-11-11 12:37:27 +01:00
sandro.labruzzo 19ce783e58 renamed workflow 2024-11-11 12:28:02 +01:00
Sandro La Bruzzo 0d0904f4ec updated workflow baseline to direct transform on OAF 2024-11-11 10:27:23 +01:00
Giambattista Bloisi f31f22801f Merge pull request 'Remove ORCID information when the same ORCID ID is used multiple times in the same result for different authors' (#503) from clean_clashing_orcids into beta
Reviewed-on: #503
2024-11-08 09:31:11 +01:00
Miriam Baglioni 6fd9ec8566 [danishfunders] added link for danish funders versus the unidentified project for IRFD (501100004836) CF (501100002808) and NNF(501100009708) 2024-11-07 13:55:31 +01:00
Giambattista Bloisi 8f5171557e Remove ORCID information when the same ORCID ID is used multiple times in the same result for different authors 2024-11-07 12:22:34 +01:00
Claudio Atzori f7bb53fe78 [orcid enrichment] added missing workflow parameter: workingDir 2024-11-07 01:04:43 +01:00
Claudio Atzori 973aa7dca6 [dedup] force the Relation schema when reading the merge rels 2024-11-06 12:29:06 +01:00
Sandro La Bruzzo c1cef5d685 removed old library joda time replaced with standard java.time introduced in java 8 2024-11-05 10:38:40 +01:00
Sandro La Bruzzo a8ed5a3b04 Organized getters and setters in the PMArticle class for better readability and maintainability. 2024-11-04 17:45:28 +01:00
Claudio Atzori a42c8b7c85 person table directory produced by the workflows raw_all and merge graphs 2024-10-30 11:25:17 +01:00
Claudio Atzori a877c76d70 make MergeUtils.selectOldestDate less prone to errors when receiving invalid date formats 2024-10-30 11:24:25 +01:00
Claudio Atzori 26cdc7e439 Avoid NPEs in MergeUtils 2024-10-30 07:35:47 +01:00
Claudio Atzori 323c76eafc patch relations job: removed non necessary logging 2024-10-30 07:35:30 +01:00
Michele De Bonis 6c17993d16 Merge branch 'beta' into dedup_new_comparators 2024-10-14 15:24:38 +02:00
Michele De Bonis eab623ddfa implementation of date matcher 2024-10-14 10:24:19 +02:00
Michele De Bonis 5015ba10eb addition of date comparator 2024-10-14 10:23:42 +02:00
Michele De Bonis 62c4c3ed29 implementation of new comparators for organization and dataset disambiguation 2024-10-09 12:26:03 +02:00
119 changed files with 4768 additions and 1110 deletions

1
.gitignore vendored
View File

@ -28,3 +28,4 @@ spark-warehouse
/**/.scalafmt.conf
/.java-version
/dhp-shade-package/dependency-reduced-pom.xml
/**/job.properties

View File

@ -10,6 +10,11 @@ public class Constants {
public static final Map<String, String> accessRightsCoarMap = Maps.newHashMap();
public static final Map<String, String> coarCodeLabelMap = Maps.newHashMap();
public static final String RAID_NS_PREFIX = "raid________";
public static final String END_DATE = "endDate";
public static final String START_DATE = "startDate";
public static final String ROR_NS_PREFIX = "ror_________";
public static final String ROR_OPENAIRE_ID = "10|openaire____::993a7ae7a863813cf95028b50708e222";

View File

@ -2,8 +2,7 @@
package eu.dnetlib.dhp.oa.merge;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.when;
import static org.apache.spark.sql.functions.*;
import java.util.Map;
import java.util.Optional;
@ -135,7 +134,9 @@ public class GroupEntitiesSparkJob {
.applyCoarVocabularies(entity, vocs),
OAFENTITY_KRYO_ENC)
.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
.mapGroups((MapGroupsFunction<String, OafEntity, OafEntity>) MergeUtils::mergeById, OAFENTITY_KRYO_ENC)
.mapGroups(
(MapGroupsFunction<String, OafEntity, OafEntity>) (key, group) -> MergeUtils.mergeById(group, vocs),
OAFENTITY_KRYO_ENC)
.map(
(MapFunction<OafEntity, Tuple2<String, OafEntity>>) t -> new Tuple2<>(
t.getClass().getName(), t),

View File

@ -2,7 +2,6 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
import static eu.dnetlib.dhp.schema.common.ModelConstants.OPENAIRE_META_RESOURCE_TYPE;
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance;
import java.net.MalformedURLException;
@ -696,6 +695,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
}
}
// set ORCID_PENDING to all orcid values that are not coming from ORCID provenance
for (Author a : r.getAuthor()) {
if (Objects.isNull(a.getPid())) {
a.setPid(Lists.newArrayList());
@ -752,6 +752,40 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.collect(Collectors.toList()));
}
}
// Identify clashing ORCIDS:that is same ORCID associated to multiple authors in this result
Map<String, Integer> clashing_orcid = new HashMap<>();
for (Author a : r.getAuthor()) {
a
.getPid()
.stream()
.filter(
p -> StringUtils
.contains(StringUtils.lowerCase(p.getQualifier().getClassid()), ORCID_PENDING))
.map(StructuredProperty::getValue)
.distinct()
.forEach(orcid -> clashing_orcid.compute(orcid, (k, v) -> (v == null) ? 1 : v + 1));
}
Set<String> clashing = clashing_orcid
.entrySet()
.stream()
.filter(ee -> ee.getValue() > 1)
.map(Map.Entry::getKey)
.collect(Collectors.toSet());
// filter out clashing orcids
for (Author a : r.getAuthor()) {
a
.setPid(
a
.getPid()
.stream()
.filter(p -> !clashing.contains(p.getValue()))
.collect(Collectors.toList()));
}
}
if (value instanceof Publication) {
@ -810,7 +844,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
return author;
}
private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
public static Optional<String> cleanDateField(Field<String> dateofacceptance) {
return Optional
.ofNullable(dateofacceptance)
.map(Field::getValue)

View File

@ -23,24 +23,30 @@ import org.apache.commons.lang3.tuple.Pair;
import com.github.sisyphsu.dateparser.DateParserUtils;
import com.google.common.base.Joiner;
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
import eu.dnetlib.dhp.schema.common.AccessRightComparator;
import eu.dnetlib.dhp.schema.common.EntityType;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
public class MergeUtils {
public static <T extends Oaf> T mergeById(String s, Iterator<T> oafEntityIterator) {
return mergeGroup(s, oafEntityIterator, true);
public static <T extends Oaf> T mergeById(Iterator<T> oafEntityIterator, VocabularyGroup vocs) {
return mergeGroup(oafEntityIterator, true, vocs);
}
public static <T extends Oaf> T mergeGroup(String s, Iterator<T> oafEntityIterator) {
return mergeGroup(s, oafEntityIterator, false);
public static <T extends Oaf> T mergeGroup(Iterator<T> oafEntityIterator) {
return mergeGroup(oafEntityIterator, false);
}
public static <T extends Oaf> T mergeGroup(String s, Iterator<T> oafEntityIterator,
boolean checkDelegateAuthority) {
public static <T extends Oaf> T mergeGroup(Iterator<T> oafEntityIterator, boolean checkDelegateAuthority) {
return mergeGroup(oafEntityIterator, checkDelegateAuthority, null);
}
public static <T extends Oaf> T mergeGroup(Iterator<T> oafEntityIterator,
boolean checkDelegateAuthority, VocabularyGroup vocs) {
ArrayList<T> sortedEntities = new ArrayList<>();
oafEntityIterator.forEachRemaining(sortedEntities::add);
@ -49,13 +55,55 @@ public class MergeUtils {
Iterator<T> it = sortedEntities.iterator();
T merged = it.next();
while (it.hasNext()) {
merged = checkedMerge(merged, it.next(), checkDelegateAuthority);
if (!it.hasNext() && merged instanceof Result && vocs != null) {
return enforceResultType(vocs, (Result) merged);
} else {
while (it.hasNext()) {
merged = checkedMerge(merged, it.next(), checkDelegateAuthority);
}
}
return merged;
}
private static <T extends Oaf> T enforceResultType(VocabularyGroup vocs, Result mergedResult) {
if (Optional.ofNullable(mergedResult.getInstance()).map(List::isEmpty).orElse(true)) {
return (T) mergedResult;
} else {
final Instance i = mergedResult.getInstance().get(0);
if (!vocs.vocabularyExists(ModelConstants.DNET_RESULT_TYPOLOGIES)) {
return (T) mergedResult;
} else {
final String expectedResultType = Optional
.ofNullable(
vocs
.lookupTermBySynonym(
ModelConstants.DNET_RESULT_TYPOLOGIES, i.getInstancetype().getClassid()))
.orElse(ModelConstants.ORP_DEFAULT_RESULTTYPE)
.getClassid();
// there is a clash among the result types
if (!expectedResultType.equals(mergedResult.getResulttype().getClassid())) {
Result result = (Result) Optional
.ofNullable(ModelSupport.oafTypes.get(expectedResultType))
.map(r -> {
try {
return r.newInstance();
} catch (InstantiationException | IllegalAccessException e) {
throw new IllegalStateException(e);
}
})
.orElse(new OtherResearchProduct());
result.setId(mergedResult.getId());
return (T) mergeResultFields(result, mergedResult);
} else {
return (T) mergedResult;
}
}
}
}
public static <T extends Oaf> T checkedMerge(final T left, final T right, boolean checkDelegateAuthority) {
return (T) merge(left, right, checkDelegateAuthority);
}
@ -106,7 +154,7 @@ public class MergeUtils {
return mergeSoftware((Software) left, (Software) right);
}
return mergeResultFields((Result) left, (Result) right);
return left;
} else if (sameClass(left, right, Datasource.class)) {
// TODO
final int trust = compareTrust(left, right);
@ -654,16 +702,9 @@ public class MergeUtils {
}
private static Field<String> selectOldestDate(Field<String> d1, Field<String> d2) {
if (d1 == null || StringUtils.isBlank(d1.getValue())) {
if (!GraphCleaningFunctions.cleanDateField(d1).isPresent()) {
return d2;
} else if (d2 == null || StringUtils.isBlank(d2.getValue())) {
return d1;
}
if (StringUtils.contains(d1.getValue(), "null")) {
return d2;
}
if (StringUtils.contains(d2.getValue(), "null")) {
} else if (!GraphCleaningFunctions.cleanDateField(d2).isPresent()) {
return d1;
}
@ -715,7 +756,11 @@ public class MergeUtils {
private static String spKeyExtractor(StructuredProperty sp) {
return Optional
.ofNullable(sp)
.map(s -> Joiner.on("||").join(qualifierKeyExtractor(s.getQualifier()), s.getValue()))
.map(
s -> Joiner
.on("||")
.useForNull("")
.join(qualifierKeyExtractor(s.getQualifier()), s.getValue()))
.orElse(null);
}

View File

@ -21,7 +21,7 @@ public class CodeMatch extends AbstractStringComparator {
public CodeMatch(Map<String, String> params) {
super(params);
this.params = params;
this.CODE_REGEX = Pattern.compile(params.getOrDefault("codeRegex", "[a-zA-Z]::\\d+"));
this.CODE_REGEX = Pattern.compile(params.getOrDefault("codeRegex", "[a-zA-Z]+::\\d+"));
}
public Set<String> getRegexList(String input) {

View File

@ -0,0 +1,67 @@
package eu.dnetlib.pace.tree;
import java.time.DateTimeException;
import java.time.LocalDate;
import java.time.Period;
import java.time.format.DateTimeFormatter;
import java.util.Locale;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("dateRange")
public class DateRange extends AbstractStringComparator {
int YEAR_RANGE;
public DateRange(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
YEAR_RANGE = Integer.parseInt(params.getOrDefault("year_range", "3"));
}
public DateRange(final double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected DateRange(final double weight, final AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
public static boolean isNumeric(String str) {
return str.matches("\\d+"); // match a number with optional '-' and decimal.
}
@Override
public double distance(final String a, final String b, final Config conf) {
if (a.isEmpty() || b.isEmpty()) {
return -1.0; // return -1 if a field is missing
}
try {
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd", Locale.ENGLISH);
LocalDate d1 = LocalDate.parse(a, formatter);
LocalDate d2 = LocalDate.parse(b, formatter);
Period period = Period.between(d1, d2);
return period.getYears() <= YEAR_RANGE ? 1.0 : 0.0;
} catch (DateTimeException e) {
return -1.0;
}
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -41,21 +41,38 @@ public class JsonListMatch extends AbstractListComparator {
return -1;
}
final Set<String> ca = sa.stream().map(this::toComparableString).collect(Collectors.toSet());
final Set<String> cb = sb.stream().map(this::toComparableString).collect(Collectors.toSet());
Set<String> ca = sa.stream().map(this::toComparableString).collect(Collectors.toSet());
Set<String> cb = sb.stream().map(this::toComparableString).collect(Collectors.toSet());
int incommon = Sets.intersection(ca, cb).size();
int simDiff = Sets.symmetricDifference(ca, cb).size();
switch (MODE) {
case "count":
return Sets.intersection(ca, cb).size();
if (incommon + simDiff == 0) {
return 0.0;
case "percentage":
int incommon = Sets.intersection(ca, cb).size();
int simDiff = Sets.symmetricDifference(ca, cb).size();
if (incommon + simDiff == 0) {
return 0.0;
}
return (double) incommon / (incommon + simDiff);
case "type":
Set<String> typesA = ca.stream().map(s -> s.split("::")[0]).collect(Collectors.toSet());
Set<String> typesB = cb.stream().map(s -> s.split("::")[0]).collect(Collectors.toSet());
Set<String> types = Sets.intersection(typesA, typesB);
if (types.isEmpty()) // if no common type, it is impossible to compare
return -1;
ca = ca.stream().filter(s -> types.contains(s.split("::")[0])).collect(Collectors.toSet());
cb = cb.stream().filter(s -> types.contains(s.split("::")[0])).collect(Collectors.toSet());
return (double) Sets.intersection(ca, cb).size() / types.size();
default:
return -1;
}
if (MODE.equals("percentage"))
return (double) incommon / (incommon + simDiff);
else
return incommon;
}
// converts every json into a comparable string basing on parameters

View File

@ -65,6 +65,43 @@ public class ComparatorTest extends AbstractPaceTest {
}
@Test
public void datasetVersionCodeMatchTest() {
params.put("codeRegex", "(?=[\\w-]*[a-zA-Z])(?=[\\w-]*\\d)[\\w-]+");
CodeMatch codeMatch = new CodeMatch(params);
// names have different codes
assertEquals(
0.0,
codeMatch
.distance(
"physical oceanography at ctd station june 1998 ev02a",
"physical oceanography at ctd station june 1998 ir02", conf));
// names have same code
assertEquals(
1.0,
codeMatch
.distance(
"physical oceanography at ctd station june 1998 ev02a",
"physical oceanography at ctd station june 1998 ev02a", conf));
// code is not in both names
assertEquals(
-1,
codeMatch
.distance(
"physical oceanography at ctd station june 1998",
"physical oceanography at ctd station june 1998 ev02a", conf));
assertEquals(
1.0,
codeMatch
.distance(
"physical oceanography at ctd station june 1998", "physical oceanography at ctd station june 1998",
conf));
}
@Test
public void listContainsMatchTest() {
@ -257,15 +294,15 @@ public class ComparatorTest extends AbstractPaceTest {
List<String> a = createFieldList(
Arrays
.asList(
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":null,\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:actionset\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"doi\",\"classname\":\"Digital Object Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"10.1111/pbi.12655\"}"),
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":null,\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:actionset\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"grid\",\"classname\":\"GRID Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"grid_1\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":null,\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:actionset\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"ror\",\"classname\":\"Research Organization Registry\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"ror_1\"}"),
"authors");
List<String> b = createFieldList(
Arrays
.asList(
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"pmc\",\"classname\":\"PubMed Central ID\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"PMC5399005\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"pmid\",\"classname\":\"PubMed ID\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"27775869\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"user:claim\",\"classname\":\"Linked by user\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"doi\",\"classname\":\"Digital Object Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"10.1111/pbi.12655\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"handle\",\"classname\":\"Handle\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"1854/LU-8523529\"}"),
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"grid\",\"classname\":\"GRID Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"grid_1\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"ror\",\"classname\":\"Research Organization Registry\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"ror_2\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"user:claim\",\"classname\":\"Linked by user\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"isni\",\"classname\":\"ISNI Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"isni_1\"}"),
"authors");
double result = jsonListMatch.compare(a, b, conf);
@ -277,6 +314,13 @@ public class ComparatorTest extends AbstractPaceTest {
result = jsonListMatch.compare(a, b, conf);
assertEquals(1.0, result);
params.put("mode", "type");
jsonListMatch = new JsonListMatch(params);
result = jsonListMatch.compare(a, b, conf);
assertEquals(0.5, result);
}
@Test
@ -327,6 +371,24 @@ public class ComparatorTest extends AbstractPaceTest {
}
@Test
public void dateMatch() {
DateRange dateRange = new DateRange(params);
double result = dateRange.distance("2021-05-13", "2023-05-13", conf);
assertEquals(1.0, result);
result = dateRange.distance("2021-05-13", "2025-05-13", conf);
assertEquals(0.0, result);
result = dateRange.distance("", "2020-05-05", conf);
assertEquals(-1.0, result);
result = dateRange.distance("invalid date", "2021-05-02", conf);
assertEquals(-1.0, result);
}
@Test
public void titleVersionMatchTest() {

View File

@ -26,16 +26,16 @@
<dependencies>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-actionmanager</artifactId>
<version>${project.version}</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-aggregation</artifactId>-->
<!-- <artifactId>dhp-actionmanager</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-aggregation</artifactId>
<version>${project.version}</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-blacklist</artifactId>-->
@ -56,61 +56,61 @@
<!-- <artifactId>dhp-enrichment</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-graph-mapper</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-graph-provision</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-impact-indicators</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-stats-actionsets</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-stats-hist-snaps</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-stats-monitor-irish</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-stats-promote</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-stats-update</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-swh</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-usage-raw-data-update</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-usage-stats-build</artifactId>
<version>${project.version}</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-graph-mapper</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-graph-provision</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-impact-indicators</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-stats-actionsets</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-stats-hist-snaps</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-stats-monitor-irish</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-stats-promote</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-stats-update</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-swh</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-usage-raw-data-update</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-usage-stats-build</artifactId>-->
<!-- <version>${project.version}</version>-->
<!-- </dependency>-->
</dependencies>

View File

@ -135,22 +135,10 @@
<arg>--outputPath</arg><arg>${workingDir}/action_payload_by_type</arg>
<arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg>
</spark>
<ok to="ForkPromote"/>
<ok to="PromoteActionPayloadForDatasetTable"/>
<error to="Kill"/>
</action>
<fork name="ForkPromote">
<path start="PromoteActionPayloadForDatasetTable"/>
<path start="PromoteActionPayloadForDatasourceTable"/>
<path start="PromoteActionPayloadForOrganizationTable"/>
<path start="PromoteActionPayloadForOtherResearchProductTable"/>
<path start="PromoteActionPayloadForProjectTable"/>
<path start="PromoteActionPayloadForPublicationTable"/>
<path start="PromoteActionPayloadForRelationTable"/>
<path start="PromoteActionPayloadForSoftwareTable"/>
<path start="PromoteActionPayloadForPersonTable"/>
</fork>
<action name="PromoteActionPayloadForDatasetTable">
<sub-workflow>
<app-path>${wf:appPath()}/promote_action_payload_for_dataset_table</app-path>
@ -162,7 +150,7 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="PromoteActionPayloadForDatasourceTable"/>
<error to="Kill"/>
</action>
@ -177,7 +165,7 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="PromoteActionPayloadForOrganizationTable"/>
<error to="Kill"/>
</action>
@ -192,7 +180,7 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="PromoteActionPayloadForOtherResearchProductTable"/>
<error to="Kill"/>
</action>
@ -207,7 +195,7 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="PromoteActionPayloadForProjectTable"/>
<error to="Kill"/>
</action>
@ -222,7 +210,7 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="PromoteActionPayloadForPublicationTable"/>
<error to="Kill"/>
</action>
@ -237,7 +225,7 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="PromoteActionPayloadForRelationTable"/>
<error to="Kill"/>
</action>
@ -252,7 +240,7 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="PromoteActionPayloadForSoftwareTable"/>
<error to="Kill"/>
</action>
@ -267,26 +255,9 @@
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="PromoteActionPayloadForPersonTable">
<sub-workflow>
<app-path>${wf:appPath()}/promote_action_payload_for_person_table</app-path>
<propagate-configuration/>
<configuration>
<property>
<name>inputActionPayloadRootPath</name>
<value>${workingDir}/action_payload_by_type</value>
</property>
</configuration>
</sub-workflow>
<ok to="JoinPromote"/>
<error to="Kill"/>
</action>
<join name="JoinPromote" to="End"/>
<end name="End"/>
</workflow-app>

View File

@ -13,6 +13,8 @@ import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Instance;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.dhp.schema.oaf.Subject;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;

View File

@ -104,22 +104,22 @@ public class PrepareAffiliationRelations implements Serializable {
.listKeyValues(OPENAIRE_DATASOURCE_ID, OPENAIRE_DATASOURCE_NAME);
JavaPairRDD<Text, Text> crossrefRelations = prepareAffiliationRelationsNewModel(
spark, crossrefInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + "::crossref");
spark, crossrefInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":crossref");
JavaPairRDD<Text, Text> pubmedRelations = prepareAffiliationRelations(
spark, pubmedInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + "::pubmed");
spark, pubmedInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":pubmed");
JavaPairRDD<Text, Text> openAPCRelations = prepareAffiliationRelationsNewModel(
spark, openapcInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + "::openapc");
spark, openapcInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":openapc");
JavaPairRDD<Text, Text> dataciteRelations = prepareAffiliationRelationsNewModel(
spark, dataciteInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + "::datacite");
spark, dataciteInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":datacite");
JavaPairRDD<Text, Text> webCrawlRelations = prepareAffiliationRelationsNewModel(
spark, webcrawlInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + "::rawaff");
spark, webcrawlInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":rawaff");
JavaPairRDD<Text, Text> publisherRelations = prepareAffiliationRelationFromPublisherNewModel(
spark, publisherlInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + "::webcrawl");
spark, publisherlInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":webcrawl");
crossrefRelations
.union(pubmedRelations)

View File

@ -15,6 +15,7 @@ import java.util.stream.Collectors;
import org.apache.commons.cli.ParseException;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
@ -29,7 +30,6 @@ import org.apache.spark.sql.Dataset;
import org.jetbrains.annotations.NotNull;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.spark_project.jetty.util.StringUtil;
import com.fasterxml.jackson.databind.ObjectMapper;
@ -193,8 +193,8 @@ public class ExtractPerson implements Serializable {
private static Relation getProjectRelation(String project, String orcid, String role) {
String source = PERSON_PREFIX + "::" + IdentifierFactory.md5(orcid);
String target = PROJECT_ID_PREFIX + project.substring(0, 14)
+ IdentifierFactory.md5(project.substring(15));
String target = PROJECT_ID_PREFIX + StringUtils.substringBefore(project, "::") + "::"
+ IdentifierFactory.md5(StringUtils.substringAfter(project, "::"));
List<KeyValue> properties = new ArrayList<>();
Relation relation = OafMapperUtils
@ -206,7 +206,7 @@ public class ExtractPerson implements Serializable {
null);
relation.setValidated(true);
if (StringUtil.isNotBlank(role)) {
if (StringUtils.isNotBlank(role)) {
KeyValue kv = new KeyValue();
kv.setKey("role");
kv.setValue(role);
@ -345,7 +345,20 @@ public class ExtractPerson implements Serializable {
OafMapperUtils
.structuredProperty(
op.getOrcid(), ModelConstants.ORCID, ModelConstants.ORCID_CLASSNAME,
ModelConstants.DNET_PID_TYPES, ModelConstants.DNET_PID_TYPES, null));
ModelConstants.DNET_PID_TYPES, ModelConstants.DNET_PID_TYPES,
OafMapperUtils
.dataInfo(
false,
null,
false,
false,
OafMapperUtils
.qualifier(
ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY,
ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY,
ModelConstants.DNET_PID_TYPES,
ModelConstants.DNET_PID_TYPES),
"0.91")));
person.setDateofcollection(op.getLastModifiedDate());
person.setOriginalId(Arrays.asList(op.getOrcid()));
person.setDataInfo(ORCIDDATAINFO);
@ -439,13 +452,13 @@ public class ExtractPerson implements Serializable {
null);
relation.setValidated(true);
if (Optional.ofNullable(row.getStartDate()).isPresent() && StringUtil.isNotBlank(row.getStartDate())) {
if (Optional.ofNullable(row.getStartDate()).isPresent() && StringUtils.isNotBlank(row.getStartDate())) {
KeyValue kv = new KeyValue();
kv.setKey("startDate");
kv.setValue(row.getStartDate());
properties.add(kv);
}
if (Optional.ofNullable(row.getEndDate()).isPresent() && StringUtil.isNotBlank(row.getEndDate())) {
if (Optional.ofNullable(row.getEndDate()).isPresent() && StringUtils.isNotBlank(row.getEndDate())) {
KeyValue kv = new KeyValue();
kv.setKey("endDate");
kv.setValue(row.getEndDate());

View File

@ -0,0 +1,203 @@
package eu.dnetlib.dhp.actionmanager.raid;
import static eu.dnetlib.dhp.actionmanager.personentity.ExtractPerson.OPENAIRE_DATASOURCE_ID;
import static eu.dnetlib.dhp.actionmanager.personentity.ExtractPerson.OPENAIRE_DATASOURCE_NAME;
import static eu.dnetlib.dhp.common.Constants.*;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.*;
import java.util.*;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.raid.model.RAiDEntity;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import eu.dnetlib.dhp.utils.DHPUtils;
import scala.Tuple2;
public class GenerateRAiDActionSetJob {
private static final Logger log = LoggerFactory
.getLogger(eu.dnetlib.dhp.actionmanager.raid.GenerateRAiDActionSetJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final List<KeyValue> RAID_COLLECTED_FROM = listKeyValues(
OPENAIRE_DATASOURCE_ID, OPENAIRE_DATASOURCE_NAME);
private static final Qualifier RAID_QUALIFIER = qualifier(
"0049", "Research Activity Identifier", DNET_PUBLICATION_RESOURCE, DNET_PUBLICATION_RESOURCE);
private static final Qualifier RAID_INFERENCE_QUALIFIER = qualifier(
"raid:openaireinference", "Inferred by OpenAIRE", DNET_PROVENANCE_ACTIONS, DNET_PROVENANCE_ACTIONS);
private static final DataInfo RAID_DATA_INFO = dataInfo(
false, OPENAIRE_DATASOURCE_NAME, true, false, RAID_INFERENCE_QUALIFIER, "0.92");
public static void main(final String[] args) throws Exception {
final String jsonConfiguration = IOUtils
.toString(
eu.dnetlib.dhp.actionmanager.raid.GenerateRAiDActionSetJob.class
.getResourceAsStream("/eu/dnetlib/dhp/actionmanager/raid/action_set_parameters.json"));
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
parser.parseArgument(args);
final Boolean isSparkSessionManaged = Optional
.ofNullable(parser.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
final String inputPath = parser.get("inputPath");
log.info("inputPath: {}", inputPath);
final String outputPath = parser.get("outputPath");
log.info("outputPath {}: ", outputPath);
final SparkConf conf = new SparkConf();
runWithSparkSession(conf, isSparkSessionManaged, spark -> {
removeOutputDir(spark, outputPath);
processRAiDEntities(spark, inputPath, outputPath);
});
}
private static void removeOutputDir(final SparkSession spark, final String path) {
HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
}
static void processRAiDEntities(final SparkSession spark,
final String inputPath,
final String outputPath) {
readInputPath(spark, inputPath)
.map(GenerateRAiDActionSetJob::prepareRAiD)
.flatMap(List::iterator)
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
}
protected static List<AtomicAction<? extends Oaf>> prepareRAiD(final RAiDEntity r) {
final Date now = new Date();
final OtherResearchProduct orp = new OtherResearchProduct();
final List<AtomicAction<? extends Oaf>> res = new ArrayList<>();
String raidId = calculateOpenaireId(r.getRaid());
orp.setId(raidId);
orp.setCollectedfrom(RAID_COLLECTED_FROM);
orp.setDataInfo(RAID_DATA_INFO);
orp
.setTitle(
Collections
.singletonList(
structuredProperty(
r.getTitle(),
qualifier("main title", "main title", DNET_DATACITE_TITLE, DNET_DATACITE_TITLE),
RAID_DATA_INFO)));
orp.setDescription(listFields(RAID_DATA_INFO, r.getSummary()));
Instance instance = new Instance();
instance.setInstancetype(RAID_QUALIFIER);
orp.setInstance(Collections.singletonList(instance));
orp
.setSubject(
r
.getSubjects()
.stream()
.map(
s -> subject(
s,
qualifier(
DNET_SUBJECT_KEYWORD, DNET_SUBJECT_KEYWORD, DNET_SUBJECT_TYPOLOGIES,
DNET_SUBJECT_TYPOLOGIES),
RAID_DATA_INFO))
.collect(Collectors.toList()));
orp
.setRelevantdate(
Arrays
.asList(
structuredProperty(
r.getEndDate(), qualifier(END_DATE, END_DATE, DNET_DATACITE_DATE, DNET_DATACITE_DATE),
RAID_DATA_INFO),
structuredProperty(
r.getStartDate(),
qualifier(START_DATE, START_DATE, DNET_DATACITE_DATE, DNET_DATACITE_DATE),
RAID_DATA_INFO)));
orp.setLastupdatetimestamp(now.getTime());
orp.setDateofacceptance(field(r.getStartDate(), RAID_DATA_INFO));
res.add(new AtomicAction<>(OtherResearchProduct.class, orp));
for (String resultId : r.getIds()) {
Relation rel1 = OafMapperUtils
.getRelation(
raidId,
resultId,
ModelConstants.RESULT_RESULT,
PART,
HAS_PART,
orp);
Relation rel2 = OafMapperUtils
.getRelation(
resultId,
raidId,
ModelConstants.RESULT_RESULT,
PART,
IS_PART_OF,
orp);
res.add(new AtomicAction<>(Relation.class, rel1));
res.add(new AtomicAction<>(Relation.class, rel2));
}
return res;
}
public static String calculateOpenaireId(final String raid) {
return String.format("50|%s::%s", RAID_NS_PREFIX, DHPUtils.md5(raid));
}
public static List<Author> createAuthors(final List<String> author) {
return author.stream().map(s -> {
Author a = new Author();
a.setFullname(s);
return a;
}).collect(Collectors.toList());
}
private static JavaRDD<RAiDEntity> readInputPath(
final SparkSession spark,
final String path) {
return spark
.read()
.json(path)
.as(Encoders.bean(RAiDEntity.class))
.toJavaRDD();
}
}

View File

@ -0,0 +1,5 @@
package eu.dnetlib.dhp.actionmanager.raid.model;
public class GenerateRAiDActionSetJob {
}

View File

@ -0,0 +1,106 @@
package eu.dnetlib.dhp.actionmanager.raid.model;
import java.io.Serializable;
import java.util.List;
public class RAiDEntity implements Serializable {
String raid;
List<String> authors;
String startDate;
String endDate;
List<String> subjects;
List<String> titles;
List<String> ids;
String title;
String summary;
public RAiDEntity() {
}
public RAiDEntity(String raid, List<String> authors, String startDate, String endDate, List<String> subjects,
List<String> titles, List<String> ids, String title, String summary) {
this.raid = raid;
this.authors = authors;
this.startDate = startDate;
this.endDate = endDate;
this.subjects = subjects;
this.titles = titles;
this.ids = ids;
this.title = title;
this.summary = summary;
}
public String getRaid() {
return raid;
}
public void setRaid(String raid) {
this.raid = raid;
}
public List<String> getAuthors() {
return authors;
}
public void setAuthors(List<String> authors) {
this.authors = authors;
}
public String getStartDate() {
return startDate;
}
public void setStartDate(String startDate) {
this.startDate = startDate;
}
public String getEndDate() {
return endDate;
}
public void setEndDate(String endDate) {
this.endDate = endDate;
}
public List<String> getSubjects() {
return subjects;
}
public void setSubjects(List<String> subjects) {
this.subjects = subjects;
}
public List<String> getTitles() {
return titles;
}
public void setTitles(List<String> titles) {
this.titles = titles;
}
public List<String> getIds() {
return ids;
}
public void setIds(List<String> ids) {
this.ids = ids;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getSummary() {
return summary;
}
public void setSummary(String summary) {
this.summary = summary;
}
}

View File

@ -44,13 +44,7 @@ import eu.dnetlib.dhp.common.Constants;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.DataInfo;
import eu.dnetlib.dhp.schema.oaf.Field;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Oaf;
import eu.dnetlib.dhp.schema.oaf.Organization;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.utils.DHPUtils;
import scala.Tuple2;

View File

@ -28,6 +28,7 @@ import eu.dnetlib.dhp.collection.plugin.mongodb.MongoDbDumpCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.oai.OaiCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.osf.OsfPreprintsCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.rest.RestCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.zenodo.CollectZenodoDumpCollectorPlugin;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
import eu.dnetlib.dhp.common.collection.CollectorException;
import eu.dnetlib.dhp.common.collection.HttpClientParams;
@ -129,6 +130,8 @@ public class CollectorWorker extends ReportingJob {
return new Gtr2PublicationsCollectorPlugin(this.clientParams);
case osfPreprints:
return new OsfPreprintsCollectorPlugin(this.clientParams);
case zenodoDump:
return new CollectZenodoDumpCollectorPlugin();
case other:
final CollectorPlugin.NAME.OTHER_NAME plugin = Optional
.ofNullable(this.api.getParams().get("other_plugin_type"))

View File

@ -154,7 +154,6 @@ public class ORCIDExtractor extends Thread {
extractedItem++;
if (extractedItem % 100000 == 0) {
log.info("Thread {}: Extracted {} items", id, extractedItem);
break;
}
}
}

View File

@ -11,7 +11,7 @@ public interface CollectorPlugin {
enum NAME {
oai, other, rest_json2xml, file, fileGzip, baseDump, gtr2Publications, osfPreprints;
oai, other, rest_json2xml, file, fileGzip, baseDump, gtr2Publications, osfPreprints, zenodoDump, research_fi;
public enum OTHER_NAME {
mdstore_mongodb_dump, mdstore_mongodb

View File

@ -1,6 +1,8 @@
package eu.dnetlib.dhp.collection.plugin.gtr2;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
@ -8,17 +10,19 @@ import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Queue;
import java.util.function.Function;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.math.NumberUtils;
import org.apache.http.Header;
import org.apache.http.HttpHeaders;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -28,12 +32,10 @@ import eu.dnetlib.dhp.common.collection.HttpConnector2;
public class Gtr2PublicationsIterator implements Iterator<String> {
public static final int PAGE_SIZE = 20;
private static final Logger log = LoggerFactory.getLogger(Gtr2PublicationsIterator.class);
private final HttpConnector2 connector;
private static final DateTimeFormatter simpleDateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
private static final DateTimeFormatter simpleDateTimeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
private static final int MAX_ATTEMPTS = 10;
@ -41,8 +43,7 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
private int currPage;
private int endPage;
private boolean incremental = false;
private DateTime fromDate;
private LocalDate fromDate;
private final Map<String, String> cache = new HashMap<>();
private final Queue<String> queue = new LinkedList<>();
@ -88,7 +89,7 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
private void prepareNextElement() {
while ((this.currPage <= this.endPage) && this.queue.isEmpty()) {
log.debug("FETCHING PAGE + " + this.currPage + "/" + this.endPage);
log.info("FETCHING PAGE + " + this.currPage + "/" + this.endPage);
this.queue.addAll(fetchPage(this.currPage++));
}
this.nextElement = this.queue.poll();
@ -97,18 +98,17 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
private List<String> fetchPage(final int pageNumber) {
final List<String> res = new ArrayList<>();
try {
final Document doc = loadURL(cleanURL(this.baseUrl + "/outcomes/publications?p=" + pageNumber), 0);
if (this.endPage == Integer.MAX_VALUE) {
this.endPage = NumberUtils.toInt(doc.valueOf("/*/@*[local-name() = 'totalPages']"));
}
try {
final Document doc = loadURL(this.baseUrl + "/publication?page=" + pageNumber, 0);
for (final Object po : doc.selectNodes("//*[local-name() = 'publication']")) {
final Element mainEntity = (Element) ((Element) po).detach();
if (filterIncremental(mainEntity)) {
res.add(expandMainEntity(mainEntity));
final String publicationOverview = mainEntity.attributeValue("url");
res.add(loadURL(publicationOverview, -1).asXML());
} else {
log.debug("Skipped entity");
}
@ -122,34 +122,6 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
return res;
}
private void addLinkedEntities(final Element master, final String relType, final Element newRoot,
final Function<Document, Element> mapper) {
for (final Object o : master.selectNodes(".//*[local-name()='link']")) {
final String rel = ((Element) o).valueOf("@*[local-name()='rel']");
final String href = ((Element) o).valueOf("@*[local-name()='href']");
if (relType.equals(rel) && StringUtils.isNotBlank(href)) {
final String cacheKey = relType + "#" + href;
if (this.cache.containsKey(cacheKey)) {
try {
log.debug(" * from cache (" + relType + "): " + href);
newRoot.add(DocumentHelper.parseText(this.cache.get(cacheKey)).getRootElement());
} catch (final DocumentException e) {
log.error("Error retrieving cache element: " + cacheKey, e);
throw new RuntimeException("Error retrieving cache element: " + cacheKey, e);
}
} else {
final Document doc = loadURL(cleanURL(href), 0);
final Element elem = mapper.apply(doc);
newRoot.add(elem);
this.cache.put(cacheKey, elem.asXML());
}
}
}
}
private boolean filterIncremental(final Element e) {
if (!this.incremental || isAfter(e.valueOf("@*[local-name() = 'created']"), this.fromDate)
|| isAfter(e.valueOf("@*[local-name() = 'updated']"), this.fromDate)) {
@ -158,58 +130,52 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
return false;
}
private String expandMainEntity(final Element mainEntity) {
final Element newRoot = DocumentHelper.createElement("doc");
newRoot.add(mainEntity);
addLinkedEntities(mainEntity, "PROJECT", newRoot, this::asProjectElement);
return DocumentHelper.createDocument(newRoot).asXML();
}
private Element asProjectElement(final Document doc) {
final Element newOrg = DocumentHelper.createElement("project");
newOrg.addElement("id").setText(doc.valueOf("/*/@*[local-name()='id']"));
newOrg
.addElement("code")
.setText(doc.valueOf("//*[local-name()='identifier' and @*[local-name()='type'] = 'RCUK']"));
newOrg.addElement("title").setText(doc.valueOf("//*[local-name()='title']"));
return newOrg;
}
private static String cleanURL(final String url) {
String cleaned = url;
if (cleaned.contains("gtr.gtr")) {
cleaned = cleaned.replace("gtr.gtr", "gtr");
}
if (cleaned.startsWith("http://")) {
cleaned = cleaned.replaceFirst("http://", "https://");
}
return cleaned;
}
private Document loadURL(final String cleanUrl, final int attempt) {
try {
log.debug(" * Downloading Url: " + cleanUrl);
final byte[] bytes = this.connector.getInputSource(cleanUrl).getBytes("UTF-8");
return DocumentHelper.parseText(new String(bytes));
try (final CloseableHttpClient client = HttpClients.createDefault()) {
final HttpGet req = new HttpGet(cleanUrl);
req.setHeader(HttpHeaders.ACCEPT, "application/xml");
try (final CloseableHttpResponse response = client.execute(req)) {
if (endPage == Integer.MAX_VALUE)
for (final Header header : response.getAllHeaders()) {
log.debug("HEADER: " + header.getName() + " = " + header.getValue());
if ("Link-Pages".equals(header.getName())) {
if (Integer.parseInt(header.getValue()) < endPage)
endPage = Integer.parseInt(header.getValue());
}
}
final String content = IOUtils.toString(response.getEntity().getContent());
return DocumentHelper.parseText(content);
}
} catch (final Throwable e) {
log.error("Error dowloading url: " + cleanUrl + ", attempt = " + attempt, e);
if (attempt == -1)
try {
return DocumentHelper.parseText("<empty></empty>");
} catch (Throwable t) {
throw new RuntimeException();
}
log.error("Error dowloading url: {}, attempt = {}", cleanUrl, attempt, e);
if (attempt >= MAX_ATTEMPTS) {
throw new RuntimeException("Error dowloading url: " + cleanUrl, e);
throw new RuntimeException("Error downloading url: " + cleanUrl, e);
}
try {
Thread.sleep(60000); // I wait for a minute
} catch (final InterruptedException e1) {
throw new RuntimeException("Error dowloading url: " + cleanUrl, e);
throw new RuntimeException("Error downloading url: " + cleanUrl, e);
}
return loadURL(cleanUrl, attempt + 1);
}
}
private DateTime parseDate(final String s) {
return DateTime.parse(s.contains("T") ? s.substring(0, s.indexOf("T")) : s, simpleDateTimeFormatter);
private LocalDate parseDate(final String s) {
return LocalDate.parse(s.contains("T") ? s.substring(0, s.indexOf("T")) : s, simpleDateTimeFormatter);
}
private boolean isAfter(final String d, final DateTime fromDate) {
private boolean isAfter(final String d, final LocalDate fromDate) {
return StringUtils.isNotBlank(d) && parseDate(d).isAfter(fromDate);
}
}

View File

@ -6,7 +6,7 @@ import java.util.Queue;
import java.util.concurrent.PriorityBlockingQueue;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.math.NumberUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.http.Header;
@ -27,25 +27,25 @@ public class ResearchFiIterator implements Iterator<String> {
private final String baseUrl;
private final String authToken;
private int currPage;
private int nPages;
private String nextUrl;
private int nCalls = 0;
private final Queue<String> queue = new PriorityBlockingQueue<>();
public ResearchFiIterator(final String baseUrl, final String authToken) {
this.baseUrl = baseUrl;
this.authToken = authToken;
this.currPage = 0;
this.nPages = 0;
this.nextUrl = null;
}
private void verifyStarted() {
if (this.currPage == 0) {
try {
nextCall();
} catch (final CollectorException e) {
throw new IllegalStateException(e);
try {
if (this.nCalls == 0) {
this.nextUrl = invokeUrl(this.baseUrl);
}
} catch (final CollectorException e) {
throw new IllegalStateException(e);
}
}
@ -62,9 +62,9 @@ public class ResearchFiIterator implements Iterator<String> {
synchronized (this.queue) {
verifyStarted();
final String res = this.queue.poll();
while (this.queue.isEmpty() && (this.currPage < this.nPages)) {
while (this.queue.isEmpty() && StringUtils.isNotBlank(this.nextUrl)) {
try {
nextCall();
this.nextUrl = invokeUrl(this.nextUrl);
} catch (final CollectorException e) {
throw new IllegalStateException(e);
}
@ -73,18 +73,11 @@ public class ResearchFiIterator implements Iterator<String> {
}
}
private void nextCall() throws CollectorException {
private String invokeUrl(final String url) throws CollectorException {
this.currPage += 1;
this.nCalls += 1;
String next = null;
final String url;
if (!this.baseUrl.contains("?")) {
url = String.format("%s?PageNumber=%d&PageSize=%d", this.baseUrl, this.currPage, PAGE_SIZE);
} else if (!this.baseUrl.contains("PageSize=")) {
url = String.format("%s&PageNumber=%d&PageSize=%d", this.baseUrl, this.currPage, PAGE_SIZE);
} else {
url = String.format("%s&PageNumber=%d", this.baseUrl, this.currPage);
}
log.info("Calling url: " + url);
try (final CloseableHttpClient client = HttpClients.createDefault()) {
@ -94,11 +87,15 @@ public class ResearchFiIterator implements Iterator<String> {
try (final CloseableHttpResponse response = client.execute(req)) {
for (final Header header : response.getAllHeaders()) {
log.debug("HEADER: " + header.getName() + " = " + header.getValue());
if ("x-page-count".equals(header.getName())) {
final int totalPages = NumberUtils.toInt(header.getValue());
if (this.nPages != totalPages) {
this.nPages = NumberUtils.toInt(header.getValue());
log.info("Total pages: " + totalPages);
if ("link".equals(header.getName())) {
final String s = StringUtils.substringBetween(header.getValue(), "<", ">");
final String token = StringUtils
.substringBefore(StringUtils.substringAfter(s, "NextPageToken="), "&");
if (this.baseUrl.contains("?")) {
next = this.baseUrl + "&NextPageToken=" + token;
} else {
next = this.baseUrl + "?NextPageToken=" + token;
}
}
}
@ -108,6 +105,9 @@ public class ResearchFiIterator implements Iterator<String> {
jsonArray.forEach(obj -> this.queue.add(JsonUtils.convertToXML(obj.toString())));
}
return next;
} catch (final Throwable e) {
log.warn("Error calling url: " + url, e);
throw new CollectorException("Error calling url: " + url, e);

View File

@ -0,0 +1,109 @@
package eu.dnetlib.dhp.collection.plugin.zenodo;
import static eu.dnetlib.dhp.utils.DHPUtils.getHadoopConfiguration;
import java.io.IOException;
import java.io.InputStream;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.collection.ApiDescriptor;
import eu.dnetlib.dhp.collection.plugin.CollectorPlugin;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
import eu.dnetlib.dhp.common.collection.CollectorException;
public class CollectZenodoDumpCollectorPlugin implements CollectorPlugin {
final private Logger log = LoggerFactory.getLogger(getClass());
private void downloadItem(final String name, final String itemURL, final String basePath,
final FileSystem fileSystem) {
try {
final Path hdfsWritePath = new Path(String.format("%s/%s", basePath, name));
final FSDataOutputStream fsDataOutputStream = fileSystem.create(hdfsWritePath, true);
final HttpGet request = new HttpGet(itemURL);
final int timeout = 60; // seconds
final RequestConfig config = RequestConfig
.custom()
.setConnectTimeout(timeout * 1000)
.setConnectionRequestTimeout(timeout * 1000)
.setSocketTimeout(timeout * 1000)
.build();
log.info("Downloading url {} into {}", itemURL, hdfsWritePath.getName());
try (CloseableHttpClient client = HttpClientBuilder.create().setDefaultRequestConfig(config).build();
CloseableHttpResponse response = client.execute(request)) {
int responseCode = response.getStatusLine().getStatusCode();
log.info("Response code is {}", responseCode);
if (responseCode >= 200 && responseCode < 400) {
IOUtils.copy(response.getEntity().getContent(), fsDataOutputStream);
fsDataOutputStream.flush();
fsDataOutputStream.hflush();
fsDataOutputStream.close();
}
} catch (Throwable eu) {
throw new RuntimeException(eu);
}
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
public FileSystem initializeFileSystem(final String hdfsURI) {
try {
return FileSystem.get(getHadoopConfiguration(hdfsURI));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
@Override
public Stream<String> collect(ApiDescriptor api, AggregatorReport report) throws CollectorException {
final String zenodoURL = api.getBaseUrl();
final String hdfsURI = api.getParams().get("hdfsURI");
final FileSystem fileSystem = initializeFileSystem(hdfsURI);
return doStream(fileSystem, zenodoURL, "/tmp");
}
public Stream<String> doStream(FileSystem fileSystem, String zenodoURL, String basePath) throws CollectorException {
try {
downloadItem("zenodoDump.tar.gz", zenodoURL, basePath, fileSystem);
CompressionCodecFactory factory = new CompressionCodecFactory(fileSystem.getConf());
Path sourcePath = new Path(basePath + "/zenodoDump.tar.gz");
CompressionCodec codec = factory.getCodec(sourcePath);
InputStream gzipInputStream = null;
try {
gzipInputStream = codec.createInputStream(fileSystem.open(sourcePath));
return iterateTar(gzipInputStream);
} catch (IOException e) {
throw new CollectorException(e);
}
} catch (Exception e) {
throw new CollectorException(e);
}
}
private Stream<String> iterateTar(InputStream gzipInputStream) throws Exception {
Iterable<String> iterable = () -> new ZenodoTarIterator(gzipInputStream);
return StreamSupport.stream(iterable.spliterator(), false);
}
}

View File

@ -0,0 +1,59 @@
package eu.dnetlib.dhp.collection.plugin.zenodo;
import java.io.Closeable;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Iterator;
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
import org.apache.commons.io.IOUtils;
public class ZenodoTarIterator implements Iterator<String>, Closeable {
private final InputStream gzipInputStream;
private final StringBuilder currentItem = new StringBuilder();
private TarArchiveInputStream tais;
private boolean hasNext;
public ZenodoTarIterator(InputStream gzipInputStream) {
this.gzipInputStream = gzipInputStream;
tais = new TarArchiveInputStream(gzipInputStream);
hasNext = getNextItem();
}
private boolean getNextItem() {
try {
TarArchiveEntry entry;
while ((entry = tais.getNextTarEntry()) != null) {
if (entry.isFile()) {
currentItem.setLength(0);
currentItem.append(IOUtils.toString(new InputStreamReader(tais)));
return true;
}
}
return false;
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
@Override
public boolean hasNext() {
return hasNext;
}
@Override
public String next() {
final String data = currentItem.toString();
hasNext = getNextItem();
return data;
}
@Override
public void close() throws IOException {
gzipInputStream.close();
}
}

View File

@ -0,0 +1,39 @@
package eu.dnetlib.dhp.sx.bio.pubmed;
/**
* The type Pubmed Affiliation.
*
* @author Sandro La Bruzzo
*/
public class PMAffiliation {
private String name;
private PMIdentifier identifier;
public PMAffiliation() {
}
public PMAffiliation(String name, PMIdentifier identifier) {
this.name = name;
this.identifier = identifier;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public PMIdentifier getIdentifier() {
return identifier;
}
public void setIdentifier(PMIdentifier identifier) {
this.identifier = identifier;
}
}

View File

@ -8,259 +8,115 @@ import java.util.List;
/**
* This class represent an instance of Pubmed Article extracted from the native XML
*
* @author Sandro La Bruzzo
*/
public class PMArticle implements Serializable {
/**
* the Pubmed Identifier
*/
private String pmid;
private String pmcId;
/**
* the DOI
*/
private String doi;
/**
* the Pubmed Date extracted from <PubmedPubDate> Specifies a date significant to either the article's history or the citation's processing.
* All <History> dates will have a <Year>, <Month>, and <Day> elements. Some may have an <Hour>, <Minute>, and <Second> element(s).
*/
private String date;
/**
* This is an 'envelop' element that contains various elements describing the journal cited; i.e., ISSN, Volume, Issue, and PubDate and author name(s), however, it does not contain data itself.
*/
private PMJournal journal;
/**
* The full journal title (taken from NLM cataloging data following NLM rules for how to compile a serial name) is exported in this element. Some characters that are not part of the NLM MEDLINE/PubMed Character Set reside in a relatively small number of full journal titles. The NLM journal title abbreviation is exported in the <MedlineTA> element.
*/
private String title;
/**
* English-language abstracts are taken directly from the published article.
* If the article does not have a published abstract, the National Library of Medicine does not create one,
* thus the record lacks the <Abstract> and <AbstractText> elements. However, in the absence of a formally
* labeled abstract in the published article, text from a substantive "summary", "summary and conclusions" or "conclusions and summary" may be used.
*/
private String description;
/**
* the language in which an article was published is recorded in <Language>.
* All entries are three letter abbreviations stored in lower case, such as eng, fre, ger, jpn, etc. When a single
* record contains more than one language value the XML export program extracts the languages in alphabetic order by the 3-letter language value.
* Some records provided by collaborating data producers may contain the value und to identify articles whose language is undetermined.
*/
private String language;
/**
* NLM controlled vocabulary, Medical Subject Headings (MeSH®), is used to characterize the content of the articles represented by MEDLINE citations. *
*/
private final List<PMSubject> subjects = new ArrayList<>();
/**
* This element is used to identify the type of article indexed for MEDLINE;
* it characterizes the nature of the information or the manner in which it is conveyed as well as the type of
* research support received (e.g., Review, Letter, Retracted Publication, Clinical Conference, Research Support, N.I.H., Extramural).
*/
private final List<PMSubject> publicationTypes = new ArrayList<>();
/**
* Personal and collective (corporate) author names published with the article are found in <AuthorList>.
*/
private List<PMSubject> subjects;
private List<PMSubject> publicationTypes = new ArrayList<>();
private List<PMAuthor> authors = new ArrayList<>();
private List<PMGrant> grants = new ArrayList<>();
/**
* <GrantID> contains the research grant or contract number (or both) that designates financial support by any agency of the United States Public Health Service
* or any institute of the National Institutes of Health. Additionally, beginning in late 2005, grant numbers are included for many other US and non-US funding agencies and organizations.
*/
private final List<PMGrant> grants = new ArrayList<>();
/**
* get the DOI
* @return a DOI
*/
public String getDoi() {
return doi;
}
/**
* Set the DOI
* @param doi a DOI
*/
public void setDoi(String doi) {
this.doi = doi;
}
/**
* get the Pubmed Identifier
* @return the PMID
*/
public String getPmid() {
return pmid;
}
/**
* set the Pubmed Identifier
* @param pmid the Pubmed Identifier
*/
public void setPmid(String pmid) {
this.pmid = pmid;
}
/**
* the Pubmed Date extracted from <PubmedPubDate> Specifies a date significant to either the article's history or the citation's processing.
* All <History> dates will have a <Year>, <Month>, and <Day> elements. Some may have an <Hour>, <Minute>, and <Second> element(s).
*
* @return the Pubmed Date
*/
public String getDate() {
return date;
}
/**
* Set the pubmed Date
* @param date
*/
public void setDate(String date) {
this.date = date;
}
/**
* The full journal title (taken from NLM cataloging data following NLM rules for how to compile a serial name) is exported in this element.
* Some characters that are not part of the NLM MEDLINE/PubMed Character Set reside in a relatively small number of full journal titles.
* The NLM journal title abbreviation is exported in the <MedlineTA> element.
*
* @return the pubmed Journal Extracted
*/
public PMJournal getJournal() {
return journal;
}
/**
* Set the mapped pubmed Journal
* @param journal
*/
public void setJournal(PMJournal journal) {
this.journal = journal;
}
/**
* <ArticleTitle> contains the entire title of the journal article. <ArticleTitle> is always in English;
* those titles originally published in a non-English language and translated for <ArticleTitle> are enclosed in square brackets.
* All titles end with a period unless another punctuation mark such as a question mark or bracket is present.
* Explanatory information about the title itself is enclosed in parentheses, e.g.: (author's transl).
* Corporate/collective authors may appear at the end of <ArticleTitle> for citations up to about the year 2000.
*
* @return the extracted pubmed Title
*/
public String getTitle() {
return title;
}
/**
* set the pubmed title
* @param title
*/
public void setTitle(String title) {
this.title = title;
}
/**
* English-language abstracts are taken directly from the published article.
* If the article does not have a published abstract, the National Library of Medicine does not create one,
* thus the record lacks the <Abstract> and <AbstractText> elements. However, in the absence of a formally
* labeled abstract in the published article, text from a substantive "summary", "summary and conclusions" or "conclusions and summary" may be used.
*
* @return the Mapped Pubmed Article Abstracts
*/
public String getDescription() {
return description;
}
/**
* Set the Mapped Pubmed Article Abstracts
* @param description
*/
public void setDescription(String description) {
this.description = description;
}
/**
* Personal and collective (corporate) author names published with the article are found in <AuthorList>.
*
* @return get the Mapped Authors lists
*/
public List<PMAuthor> getAuthors() {
return authors;
}
/**
* Set the Mapped Authors lists
* @param authors
*/
public void setAuthors(List<PMAuthor> authors) {
this.authors = authors;
}
/**
* This element is used to identify the type of article indexed for MEDLINE;
* it characterizes the nature of the information or the manner in which it is conveyed as well as the type of
* research support received (e.g., Review, Letter, Retracted Publication, Clinical Conference, Research Support, N.I.H., Extramural).
*
* @return the mapped Subjects
*/
public List<PMSubject> getSubjects() {
return subjects;
}
/**
*
* the language in which an article was published is recorded in <Language>.
* All entries are three letter abbreviations stored in lower case, such as eng, fre, ger, jpn, etc. When a single
* record contains more than one language value the XML export program extracts the languages in alphabetic order by the 3-letter language value.
* Some records provided by collaborating data producers may contain the value und to identify articles whose language is undetermined.
*
* @return The mapped Language
*/
public String getLanguage() {
return language;
}
/**
*
* Set The mapped Language
*
* @param language the mapped Language
*/
public void setLanguage(String language) {
this.language = language;
}
/**
* This element is used to identify the type of article indexed for MEDLINE;
* it characterizes the nature of the information or the manner in which it is conveyed as well as the type of
* research support received (e.g., Review, Letter, Retracted Publication, Clinical Conference, Research Support, N.I.H., Extramural).
*
* @return the mapped Publication Type
*/
public List<PMSubject> getPublicationTypes() {
return publicationTypes;
}
/**
* <GrantID> contains the research grant or contract number (or both) that designates financial support by any agency of the United States Public Health Service
* or any institute of the National Institutes of Health. Additionally, beginning in late 2005, grant numbers are included for many other US and non-US funding agencies and organizations.
* @return the mapped grants
*/
public List<PMGrant> getGrants() {
return grants;
}
public String getPmcId() {
return pmcId;
}
public PMArticle setPmcId(String pmcId) {
public void setPmcId(String pmcId) {
this.pmcId = pmcId;
return this;
}
public String getDoi() {
return doi;
}
public void setDoi(String doi) {
this.doi = doi;
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public PMJournal getJournal() {
return journal;
}
public void setJournal(PMJournal journal) {
this.journal = journal;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getDescription() {
return description;
}
public void setDescription(String description) {
this.description = description;
}
public String getLanguage() {
return language;
}
public void setLanguage(String language) {
this.language = language;
}
public List<PMSubject> getSubjects() {
return subjects;
}
public void setSubjects(List<PMSubject> subjects) {
this.subjects = subjects;
}
public List<PMSubject> getPublicationTypes() {
return publicationTypes;
}
public void setPublicationTypes(List<PMSubject> publicationTypes) {
this.publicationTypes = publicationTypes;
}
public List<PMAuthor> getAuthors() {
return authors;
}
public void setAuthors(List<PMAuthor> authors) {
this.authors = authors;
}
public List<PMGrant> getGrants() {
return grants;
}
public void setGrants(List<PMGrant> grants) {
this.grants = grants;
}
}

View File

@ -12,6 +12,8 @@ public class PMAuthor implements Serializable {
private String lastName;
private String foreName;
private PMIdentifier identifier;
private PMAffiliation affiliation;
/**
* Gets last name.
@ -59,4 +61,40 @@ public class PMAuthor implements Serializable {
.format("%s, %s", this.foreName != null ? this.foreName : "", this.lastName != null ? this.lastName : "");
}
/**
* Gets identifier.
*
* @return the identifier
*/
public PMIdentifier getIdentifier() {
return identifier;
}
/**
* Sets identifier.
*
* @param identifier the identifier
*/
public void setIdentifier(PMIdentifier identifier) {
this.identifier = identifier;
}
/**
* Gets affiliation.
*
* @return the affiliation
*/
public PMAffiliation getAffiliation() {
return affiliation;
}
/**
* Sets affiliation.
*
* @param affiliation the affiliation
*/
public void setAffiliation(PMAffiliation affiliation) {
this.affiliation = affiliation;
}
}

View File

@ -0,0 +1,53 @@
package eu.dnetlib.dhp.sx.bio.pubmed;
public class PMIdentifier {
private String pid;
private String type;
public PMIdentifier(String pid, String type) {
this.pid = cleanPid(pid);
this.type = type;
}
public PMIdentifier() {
}
private String cleanPid(String pid) {
if (pid == null) {
return null;
}
// clean ORCID ID in the form 0000000163025705 to 0000-0001-6302-5705
if (pid.matches("[0-9]{15}[0-9X]")) {
return pid.replaceAll("(.{4})(.{4})(.{4})(.{4})", "$1-$2-$3-$4");
}
// clean ORCID in the form http://orcid.org/0000-0001-8567-3543 to 0000-0001-8567-3543
if (pid.matches("http://orcid.org/[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}")) {
return pid.replaceAll("http://orcid.org/", "");
}
return pid;
}
public String getPid() {
return pid;
}
public PMIdentifier setPid(String pid) {
this.pid = cleanPid(pid);
return this;
}
public String getType() {
return type;
}
public PMIdentifier setType(String type) {
this.type = type;
return this;
}
}

View File

@ -0,0 +1,14 @@
[
{
"paramName": "i",
"paramLongName": "inputPath",
"paramDescription": "the path of the input json",
"paramRequired": true
},
{
"paramName": "o",
"paramLongName": "outputPath",
"paramDescription": "the path of the new ActionSet",
"paramRequired": true
}
]

View File

@ -0,0 +1,58 @@
<configuration>
<property>
<name>jobTracker</name>
<value>yarnRM</value>
</property>
<property>
<name>nameNode</name>
<value>hdfs://nameservice1</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>oozie.action.sharelib.for.spark</name>
<value>spark2</value>
</property>
<property>
<name>hive_metastore_uris</name>
<value>thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083</value>
</property>
<property>
<name>spark2YarnHistoryServerAddress</name>
<value>http://iis-cdh5-test-gw.ocean.icm.edu.pl:18089</value>
</property>
<property>
<name>spark2ExtraListeners</name>
<value>com.cloudera.spark.lineage.NavigatorAppListener</value>
</property>
<property>
<name>spark2SqlQueryExecutionListeners</name>
<value>com.cloudera.spark.lineage.NavigatorQueryListener</value>
</property>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
<property>
<name>sparkExecutorNumber</name>
<value>4</value>
</property>
<property>
<name>spark2EventLogDir</name>
<value>/user/spark/spark2ApplicationHistory</value>
</property>
<property>
<name>sparkDriverMemory</name>
<value>15G</value>
</property>
<property>
<name>sparkExecutorMemory</name>
<value>6G</value>
</property>
<property>
<name>sparkExecutorCores</name>
<value>1</value>
</property>
</configuration>

View File

@ -0,0 +1,53 @@
<workflow-app name="Update_RAiD_action_set" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>raidJsonInputPath</name>
<description>the path of the json</description>
</property>
<property>
<name>raidActionSetPath</name>
<description>path where to store the action set</description>
</property>
</parameters>
<start to="deleteoutputpath"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="deleteoutputpath">
<fs>
<delete path='${raidActionSetPath}'/>
<mkdir path='${raidActionSetPath}'/>
</fs>
<ok to="processRAiDFile"/>
<error to="Kill"/>
</action>
<action name="processRAiDFile">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>ProcessRAiDFile</name>
<class>eu.dnetlib.dhp.actionmanager.raid.GenerateRAiDActionSetJob</class>
<jar>dhp-aggregation-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=3840
</spark-opts>
<arg>--inputPath</arg><arg>${raidJsonInputPath}</arg>
<arg>--outputPath</arg><arg>${raidActionSetPath}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

View File

@ -1,8 +1,7 @@
[
{"paramName":"mt", "paramLongName":"master", "paramDescription": "should be local or yarn", "paramRequired": true},
{"paramName":"i", "paramLongName":"isLookupUrl", "paramDescription": "isLookupUrl", "paramRequired": true},
{"paramName":"w", "paramLongName":"workingPath", "paramDescription": "the path of the sequencial file to read", "paramRequired": true},
{"paramName":"mo", "paramLongName":"mdstoreOutputVersion", "paramDescription": "the oaf path ", "paramRequired": true},
{"paramName":"s", "paramLongName":"skipUpdate", "paramDescription": "skip update ", "paramRequired": false},
{"paramName":"h", "paramLongName":"hdfsServerUri", "paramDescription": "the working path ", "paramRequired": true}
{"paramName":"mt", "paramLongName":"master", "paramDescription": "should be local or yarn", "paramRequired": true},
{"paramName":"i", "paramLongName":"isLookupUrl", "paramDescription": "isLookupUrl", "paramRequired": true},
{"paramName":"s", "paramLongName":"sourcePath", "paramDescription": "the baseline path", "paramRequired": true},
{"paramName":"mo", "paramLongName":"mdstoreOutputVersion", "paramDescription": "the mdstore path to save", "paramRequired": true}
]

View File

@ -1,4 +1,4 @@
<workflow-app name="Download_Transform_Pubmed_Workflow" xmlns="uri:oozie:workflow:0.5">
<workflow-app name="Transform_Pubmed_Workflow" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>baselineWorkingPath</name>
@ -16,11 +16,6 @@
<name>mdStoreManagerURI</name>
<description>the path of the cleaned mdstore</description>
</property>
<property>
<name>skipUpdate</name>
<value>false</value>
<description>The request block size</description>
</property>
</parameters>
<start to="StartTransaction"/>
@ -44,16 +39,16 @@
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
<capture-output/>
</java>
<ok to="ConvertDataset"/>
<ok to="TransformPubMed"/>
<error to="RollBack"/>
</action>
<action name="ConvertDataset">
<action name="TransformPubMed">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Convert Baseline to OAF Dataset</name>
<class>eu.dnetlib.dhp.sx.bio.ebi.SparkCreateBaselineDataFrame</class>
<name>Convert Baseline Pubmed to OAF Dataset</name>
<class>eu.dnetlib.dhp.sx.bio.ebi.SparkCreatePubmedDump</class>
<jar>dhp-aggregation-${projectVersion}.jar</jar>
<spark-opts>
--executor-memory=${sparkExecutorMemory}
@ -65,12 +60,10 @@
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
</spark-opts>
<arg>--workingPath</arg><arg>${baselineWorkingPath}</arg>
<arg>--sourcePath</arg><arg>${baselineWorkingPath}</arg>
<arg>--mdstoreOutputVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--master</arg><arg>yarn</arg>
<arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg>
<arg>--hdfsServerUri</arg><arg>${nameNode}</arg>
<arg>--skipUpdate</arg><arg>${skipUpdate}</arg>
</spark>
<ok to="CommitVersion"/>
<error to="RollBack"/>

View File

@ -37,7 +37,7 @@ case class mappingAuthor(
family: Option[String],
sequence: Option[String],
ORCID: Option[String],
affiliation: Option[mappingAffiliation]
affiliation: Option[List[mappingAffiliation]]
) {}
case class funderInfo(id: String, uri: String, name: String, synonym: List[String]) {}
@ -457,15 +457,14 @@ case object Crossref2Oaf {
}
//Mapping Author
val authorList: List[mappingAuthor] =
(json \ "author").extract[List[mappingAuthor]].filter(a => a.family.isDefined)
val authorList: List[mappingAuthor] = (json \ "author").extract[List[mappingAuthor]].filter(a => a.family.isDefined)
val sorted_list = authorList.sortWith((a: mappingAuthor, b: mappingAuthor) =>
a.sequence.isDefined && a.sequence.get.equalsIgnoreCase("first")
)
result.setAuthor(sorted_list.zipWithIndex.map { case (a, index) =>
generateAuhtor(a.given.orNull, a.family.get, a.ORCID.orNull, index)
generateAuthor(a.given.orNull, a.family.get, a.ORCID.orNull, index, a.affiliation)
}.asJava)
// Mapping instance
@ -504,19 +503,6 @@ case object Crossref2Oaf {
)
}
val is_review = json \ "relation" \ "is-review-of" \ "id"
if (is_review != JNothing) {
instance.setInstancetype(
OafMapperUtils.qualifier(
"0015",
"peerReviewed",
ModelConstants.DNET_REVIEW_LEVELS,
ModelConstants.DNET_REVIEW_LEVELS
)
)
}
if (doi.startsWith("10.3410") || doi.startsWith("10.12703"))
instance.setHostedby(
OafMapperUtils.keyValue(OafMapperUtils.createOpenaireId(10, "openaire____::H1Connect", true), "H1Connect")
@ -574,12 +560,23 @@ case object Crossref2Oaf {
s"50|doiboost____|$id"
}
def generateAuhtor(given: String, family: String, orcid: String, index: Int): Author = {
private def generateAuthor(
given: String,
family: String,
orcid: String,
index: Int,
affiliation: Option[List[mappingAffiliation]]
): Author = {
val a = new Author
a.setName(given)
a.setSurname(family)
a.setFullname(s"$given $family")
a.setRank(index + 1)
// Adding Raw affiliation if it's defined
if (affiliation.isDefined) {
a.setRawAffiliationString(affiliation.get.map(a => a.name).asJava)
}
if (StringUtils.isNotBlank(orcid))
a.setPid(
List(
@ -673,11 +670,11 @@ case object Crossref2Oaf {
val doi = input.getString(0)
val rorId = input.getString(1)
val pubId = s"50|${PidType.doi.toString.padTo(12, "_")}::${DoiCleaningRule.clean(doi)}"
val pubId = IdentifierFactory.idFromPid("50", "doi", DoiCleaningRule.clean(doi), true)
val affId = GenerateRorActionSetJob.calculateOpenaireId(rorId)
val r: Relation = new Relation
DoiCleaningRule.clean(doi)
r.setSource(pubId)
r.setTarget(affId)
r.setRelType(ModelConstants.RESULT_ORGANIZATION)
@ -705,7 +702,15 @@ case object Crossref2Oaf {
val objectType = (json \ "type").extractOrElse[String](null)
if (objectType == null)
return resultList
val typology = getTypeQualifier(objectType, vocabularies)
// If the item has a relations is-review-of, then we force it to a peer-review
val is_review = json \ "relation" \ "is-review-of" \ "id"
var force_to_review = false
if (is_review != JNothing) {
force_to_review = true
}
val typology = getTypeQualifier(if (force_to_review) "peer-review" else objectType, vocabularies)
if (typology == null)
return List()
@ -757,33 +762,6 @@ case object Crossref2Oaf {
else
resultList
}
// if (uw != null) {
// result.getCollectedfrom.add(createUnpayWallCollectedFrom())
// val i: Instance = new Instance()
// i.setCollectedfrom(createUnpayWallCollectedFrom())
// if (uw.best_oa_location != null) {
//
// i.setUrl(List(uw.best_oa_location.url).asJava)
// if (uw.best_oa_location.license.isDefined) {
// i.setLicense(field[String](uw.best_oa_location.license.get, null))
// }
//
// val colour = get_unpaywall_color(uw.oa_status)
// if (colour.isDefined) {
// val a = new AccessRight
// a.setClassid(ModelConstants.ACCESS_RIGHT_OPEN)
// a.setClassname(ModelConstants.ACCESS_RIGHT_OPEN)
// a.setSchemeid(ModelConstants.DNET_ACCESS_MODES)
// a.setSchemename(ModelConstants.DNET_ACCESS_MODES)
// a.setOpenAccessRoute(colour.get)
// i.setAccessright(a)
// }
// i.setPid(result.getPid)
// result.getInstance().add(i)
// }
// }
}
private def createCiteRelation(source: Result, targetPid: String, targetPidType: String): List[Relation] = {
@ -978,7 +956,26 @@ case object Crossref2Oaf {
case "10.13039/501100010790" =>
generateSimpleRelationFromAward(funder, "erasmusplus_", a => a)
case _ => logger.debug("no match for " + funder.DOI.get)
//Add for Danish funders
//Independent Research Fund Denmark (IRFD)
case "10.13039/501100004836" =>
generateSimpleRelationFromAward(funder, "irfd________", a => a)
val targetId = getProjectId("irfd________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Carlsberg Foundation (CF)
case "10.13039/501100002808" =>
generateSimpleRelationFromAward(funder, "cf__________", a => a)
val targetId = getProjectId("cf__________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Novo Nordisk Foundation (NNF)
case "10.13039/501100009708" =>
generateSimpleRelationFromAward(funder, "nnf___________", a => a)
val targetId = getProjectId("nnf_________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
case _ => logger.debug("no match for " + funder.DOI.get)
}
} else {

View File

@ -0,0 +1,104 @@
package eu.dnetlib.dhp.sx.bio.ebi
import com.fasterxml.jackson.databind.ObjectMapper
import eu.dnetlib.dhp.application.AbstractScalaApplication
import eu.dnetlib.dhp.common.Constants
import eu.dnetlib.dhp.common.Constants.{MDSTORE_DATA_PATH, MDSTORE_SIZE_PATH}
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup
import eu.dnetlib.dhp.schema.mdstore.MDStoreVersion
import eu.dnetlib.dhp.sx.bio.pubmed.{PMArticle, PMParser2, PubMedToOaf}
import eu.dnetlib.dhp.transformation.TransformSparkJobNode
import eu.dnetlib.dhp.utils.DHPUtils.writeHdfsFile
import eu.dnetlib.dhp.utils.ISLookupClientFactory
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.slf4j.{Logger, LoggerFactory}
class SparkCreatePubmedDump(propertyPath: String, args: Array[String], log: Logger)
extends AbstractScalaApplication(propertyPath, args, log: Logger) {
/** Here all the spark applications runs this method
* where the whole logic of the spark node is defined
*/
override def run(): Unit = {
val isLookupUrl: String = parser.get("isLookupUrl")
log.info("isLookupUrl: {}", isLookupUrl)
val sourcePath = parser.get("sourcePath")
log.info(s"SourcePath is '$sourcePath'")
val mdstoreOutputVersion = parser.get("mdstoreOutputVersion")
log.info(s"mdstoreOutputVersion is '$mdstoreOutputVersion'")
val mapper = new ObjectMapper()
val cleanedMdStoreVersion = mapper.readValue(mdstoreOutputVersion, classOf[MDStoreVersion])
val outputBasePath = cleanedMdStoreVersion.getHdfsPath
log.info(s"outputBasePath is '$outputBasePath'")
val isLookupService = ISLookupClientFactory.getLookUpService(isLookupUrl)
val vocabularies = VocabularyGroup.loadVocsFromIS(isLookupService)
createPubmedDump(spark, sourcePath, outputBasePath, vocabularies)
}
/** This method creates a dump of the pubmed articles
* @param spark the spark session
* @param sourcePath the path of the source file
* @param targetPath the path of the target file
* @param vocabularies the vocabularies
*/
def createPubmedDump(
spark: SparkSession,
sourcePath: String,
targetPath: String,
vocabularies: VocabularyGroup
): Unit = {
require(spark != null)
implicit val PMEncoder: Encoder[PMArticle] = Encoders.bean(classOf[PMArticle])
import spark.implicits._
val df = spark.read.option("lineSep", "</PubmedArticle>").text(sourcePath)
val mapper = new ObjectMapper()
df.as[String]
.map(s => {
val id = s.indexOf("<PubmedArticle>")
if (id >= 0) s"${s.substring(id)}</PubmedArticle>" else null
})
.filter(s => s != null)
.map { i =>
//remove try catch
try {
new PMParser2().parse(i)
} catch {
case _: Exception => {
throw new RuntimeException(s"Error parsing article: $i")
}
}
}
.dropDuplicates("pmid")
.map { a =>
val oaf = PubMedToOaf.convert(a, vocabularies)
if (oaf != null)
mapper.writeValueAsString(oaf)
else
null
}
.as[String]
.filter(s => s != null)
.write
.option("compression", "gzip")
.mode("overwrite")
.text(targetPath + MDSTORE_DATA_PATH)
val mdStoreSize = spark.read.text(targetPath + MDSTORE_DATA_PATH).count
writeHdfsFile(spark.sparkContext.hadoopConfiguration, "" + mdStoreSize, targetPath + MDSTORE_SIZE_PATH)
}
}
object SparkCreatePubmedDump {
def main(args: Array[String]): Unit = {
val log: Logger = LoggerFactory.getLogger(getClass)
new SparkCreatePubmedDump("/eu/dnetlib/dhp/sx/bio/ebi/baseline_to_oaf_params.json", args, log).initialize().run()
}
}

View File

@ -0,0 +1,277 @@
package eu.dnetlib.dhp.sx.bio.pubmed
import org.apache.commons.lang3.StringUtils
import javax.xml.stream.XMLEventReader
import scala.collection.JavaConverters._
import scala.xml.{MetaData, NodeSeq}
import scala.xml.pull.{EvElemEnd, EvElemStart, EvText}
class PMParser2 {
/** Extracts the value of an attribute from a MetaData object.
* @param attrs the MetaData object
* @param key the key of the attribute
* @return the value of the attribute or null if the attribute is not found
*/
private def extractAttributes(attrs: MetaData, key: String): String = {
val res = attrs.get(key)
if (res.isDefined) {
val s = res.get
if (s != null && s.nonEmpty)
s.head.text
else
null
} else null
}
/** Validates and formats a date given the year, month, and day as strings.
*
* @param year the year as a string
* @param month the month as a string
* @param day the day as a string
* @return the formatted date as "YYYY-MM-DD" or null if the date is invalid
*/
private def validate_Date(year: String, month: String, day: String): String = {
try {
f"${year.toInt}-${month.toInt}%02d-${day.toInt}%02d"
} catch {
case _: Throwable => null
}
}
/** Extracts the grant information from a NodeSeq object.
*
* @param gNode the NodeSeq object
* @return the grant information or an empty list if the grant information is not found
*/
private def extractGrant(gNode: NodeSeq): List[PMGrant] = {
gNode
.map(node => {
val grantId = (node \ "GrantID").text
val agency = (node \ "Agency").text
val country = (node \ "Country").text
new PMGrant(grantId, agency, country)
})
.toList
}
/** Extracts the journal information from a NodeSeq object.
*
* @param jNode the NodeSeq object
* @return the journal information or null if the journal information is not found
*/
private def extractJournal(jNode: NodeSeq): PMJournal = {
val journal = new PMJournal
journal.setTitle((jNode \ "Title").text)
journal.setIssn((jNode \ "ISSN").text)
journal.setVolume((jNode \ "JournalIssue" \ "Volume").text)
journal.setIssue((jNode \ "JournalIssue" \ "Issue").text)
if (journal.getTitle != null && StringUtils.isNotEmpty(journal.getTitle))
journal
else
null
}
private def extractAuthors(aNode: NodeSeq): List[PMAuthor] = {
aNode
.map(author => {
val a = new PMAuthor
a.setLastName((author \ "LastName").text)
a.setForeName((author \ "ForeName").text)
val id = (author \ "Identifier").text
val idType = (author \ "Identifier" \ "@Source").text
if (id != null && id.nonEmpty && idType != null && idType.nonEmpty) {
a.setIdentifier(new PMIdentifier(id, idType))
}
val affiliation = (author \ "AffiliationInfo" \ "Affiliation").text
val affiliationId = (author \ "AffiliationInfo" \ "Identifier").text
val affiliationIdType = (author \ "AffiliationInfo" \ "Identifier" \ "@Source").text
if (affiliation != null && affiliation.nonEmpty) {
val aff = new PMAffiliation()
aff.setName(affiliation)
if (
affiliationId != null && affiliationId.nonEmpty && affiliationIdType != null && affiliationIdType.nonEmpty
) {
aff.setIdentifier(new PMIdentifier(affiliationId, affiliationIdType))
}
a.setAffiliation(aff)
}
a
})
.toList
}
def parse(input: String): PMArticle = {
val xml = scala.xml.XML.loadString(input)
val article = new PMArticle
val grantNodes = xml \ "MedlineCitation" \\ "Grant"
article.setGrants(extractGrant(grantNodes).asJava)
val journal = xml \ "MedlineCitation" \ "Article" \ "Journal"
article.setJournal(extractJournal(journal))
val authors = xml \ "MedlineCitation" \ "Article" \ "AuthorList" \ "Author"
article.setAuthors(
extractAuthors(authors).asJava
)
val pmId = xml \ "MedlineCitation" \ "PMID"
val articleIds = xml \ "PubmedData" \ "ArticleIdList" \ "ArticleId"
articleIds.foreach(articleId => {
val idType = (articleId \ "@IdType").text
val id = articleId.text
if ("doi".equalsIgnoreCase(idType)) article.setDoi(id)
if ("pmc".equalsIgnoreCase(idType)) article.setPmcId(id)
})
article.setPmid(pmId.text)
val pubMedPubDate = xml \ "MedlineCitation" \ "DateCompleted"
val currentDate =
validate_Date((pubMedPubDate \ "Year").text, (pubMedPubDate \ "Month").text, (pubMedPubDate \ "Day").text)
if (currentDate != null) article.setDate(currentDate)
val articleTitle = xml \ "MedlineCitation" \ "Article" \ "ArticleTitle"
article.setTitle(articleTitle.text)
val abstractText = xml \ "MedlineCitation" \ "Article" \ "Abstract" \ "AbstractText"
if (abstractText != null && abstractText.text != null && abstractText.text.nonEmpty)
article.setDescription(abstractText.text.split("\n").map(s => s.trim).mkString(" ").trim)
val language = xml \ "MedlineCitation" \ "Article" \ "Language"
article.setLanguage(language.text)
val subjects = xml \ "MedlineCitation" \ "MeshHeadingList" \ "MeshHeading"
article.setSubjects(
subjects
.take(20)
.map(subject => {
val descriptorName = (subject \ "DescriptorName").text
val ui = (subject \ "DescriptorName" \ "@UI").text
val s = new PMSubject
s.setValue(descriptorName)
s.setMeshId(ui)
s
})
.toList
.asJava
)
val publicationTypes = xml \ "MedlineCitation" \ "Article" \ "PublicationTypeList" \ "PublicationType"
article.setPublicationTypes(
publicationTypes
.map(pt => {
val s = new PMSubject
s.setValue(pt.text)
s
})
.toList
.asJava
)
article
}
def parse2(xml: XMLEventReader): PMArticle = {
var currentArticle: PMArticle = null
var currentSubject: PMSubject = null
var currentAuthor: PMAuthor = null
var currentJournal: PMJournal = null
var currentGrant: PMGrant = null
var currNode: String = null
var currentYear = "0"
var currentMonth = "01"
var currentDay = "01"
var currentArticleType: String = null
while (xml.hasNext) {
val ne = xml.next
ne match {
case EvElemStart(_, label, attrs, _) =>
currNode = label
label match {
case "PubmedArticle" => currentArticle = new PMArticle
case "Author" => currentAuthor = new PMAuthor
case "Journal" => currentJournal = new PMJournal
case "Grant" => currentGrant = new PMGrant
case "PublicationType" | "DescriptorName" =>
currentSubject = new PMSubject
currentSubject.setMeshId(extractAttributes(attrs, "UI"))
case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
case _ =>
}
case EvElemEnd(_, label) =>
label match {
case "PubmedArticle" => return currentArticle
case "Author" => currentArticle.getAuthors.add(currentAuthor)
case "Journal" => currentArticle.setJournal(currentJournal)
case "Grant" => currentArticle.getGrants.add(currentGrant)
case "PubMedPubDate" =>
if (currentArticle.getDate == null)
currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
case "PubDate" => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
case "DescriptorName" => currentArticle.getSubjects.add(currentSubject)
case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
case _ =>
}
case EvText(text) =>
if (currNode != null && text.trim.nonEmpty)
currNode match {
case "ArticleTitle" => {
if (currentArticle.getTitle == null)
currentArticle.setTitle(text.trim)
else
currentArticle.setTitle(currentArticle.getTitle + text.trim)
}
case "AbstractText" => {
if (currentArticle.getDescription == null)
currentArticle.setDescription(text.trim)
else
currentArticle.setDescription(currentArticle.getDescription + text.trim)
}
case "PMID" => currentArticle.setPmid(text.trim)
case "ArticleId" =>
if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
case "Language" => currentArticle.setLanguage(text.trim)
case "ISSN" => currentJournal.setIssn(text.trim)
case "GrantID" => currentGrant.setGrantID(text.trim)
case "Agency" => currentGrant.setAgency(text.trim)
case "Country" => if (currentGrant != null) currentGrant.setCountry(text.trim)
case "Year" => currentYear = text.trim
case "Month" => currentMonth = text.trim
case "Day" => currentDay = text.trim
case "Volume" => currentJournal.setVolume(text.trim)
case "Issue" => currentJournal.setIssue(text.trim)
case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
case "LastName" => {
if (currentAuthor != null)
currentAuthor.setLastName(text.trim)
}
case "ForeName" =>
if (currentAuthor != null)
currentAuthor.setForeName(text.trim)
case "Title" =>
if (currentJournal.getTitle == null)
currentJournal.setTitle(text.trim)
else
currentJournal.setTitle(currentJournal.getTitle + text.trim)
case _ =>
}
case _ =>
}
}
null
}
}

View File

@ -294,6 +294,24 @@ object PubMedToOaf {
author.setName(a.getForeName)
author.setSurname(a.getLastName)
author.setFullname(a.getFullName)
if (a.getIdentifier != null) {
author.setPid(
List(
OafMapperUtils.structuredProperty(
a.getIdentifier.getPid,
OafMapperUtils.qualifier(
a.getIdentifier.getType,
a.getIdentifier.getType,
ModelConstants.DNET_PID_TYPES,
ModelConstants.DNET_PID_TYPES
),
dataInfo
)
).asJava
)
}
if (a.getAffiliation != null)
author.setRawAffiliationString(List(a.getAffiliation.getName).asJava)
author.setRank(index + 1)
author
}(collection.breakOut)

View File

@ -0,0 +1,165 @@
package eu.dnetlib.dhp.actionmanager.raid;
import static java.nio.file.Files.createTempDirectory;
import static eu.dnetlib.dhp.actionmanager.Constants.OBJECT_MAPPER;
import static org.junit.jupiter.api.Assertions.assertEquals;
import java.io.File;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import eu.dnetlib.dhp.actionmanager.opencitations.CreateOpenCitationsASTest;
import eu.dnetlib.dhp.actionmanager.raid.model.RAiDEntity;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.oaf.Oaf;
import eu.dnetlib.dhp.schema.oaf.OtherResearchProduct;
import eu.dnetlib.dhp.schema.oaf.Relation;
import scala.Tuple2;
public class GenerateRAiDActionSetJobTest {
private static String input_path;
private static String output_path;
static SparkSession spark;
@BeforeEach
void setUp() throws Exception {
input_path = Paths
.get(
GenerateRAiDActionSetJobTest.class
.getResource("/eu/dnetlib/dhp/actionmanager/raid/raid_example.json")
.toURI())
.toFile()
.getAbsolutePath();
output_path = createTempDirectory(GenerateRAiDActionSetJobTest.class.getSimpleName() + "-")
.toAbsolutePath()
.toString();
SparkConf conf = new SparkConf();
conf.setAppName(GenerateRAiDActionSetJobTest.class.getSimpleName());
conf.setMaster("local[*]");
conf.set("spark.driver.host", "localhost");
conf.set("hive.metastore.local", "true");
conf.set("spark.ui.enabled", "false");
conf.set("spark.sql.warehouse.dir", output_path);
conf.set("hive.metastore.warehouse.dir", output_path);
spark = SparkSession
.builder()
.appName(GenerateRAiDActionSetJobTest.class.getSimpleName())
.config(conf)
.getOrCreate();
}
@AfterAll
static void cleanUp() throws Exception {
FileUtils.deleteDirectory(new File(output_path));
}
@Test
@Disabled
void testProcessRAiDEntities() {
GenerateRAiDActionSetJob.processRAiDEntities(spark, input_path, output_path + "/test_raid_action_set");
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<? extends Oaf> result = sc
.sequenceFile(output_path + "/test_raid_action_set", Text.class, Text.class)
.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
.map(AtomicAction::getPayload);
assertEquals(80, result.count());
}
@Test
void testPrepareRAiD() {
List<AtomicAction<? extends Oaf>> atomicActions = GenerateRAiDActionSetJob
.prepareRAiD(
new RAiDEntity(
"-92190526",
Arrays
.asList(
"Berli, Justin", "Le Mao, Bérénice", "Guillaume Touya", "Wenclik, Laura",
"Courtial, Azelle", "Muehlenhaus, Ian", "Justin Berli", "Touya, Guillaume",
"Gruget, Maïeul", "Azelle Courtial", "Ian Muhlenhaus", "Maïeul Gruget", "Marion Dumont",
"Maïeul GRUGET", "Cécile Duchêne"),
"2021-09-10",
"2024-02-16",
Arrays
.asList(
"cartography, zoom, pan, desert fog", "Road network", "zooming", "Pan-scalar maps",
"pan-scalar map", "Python library", "QGIS", "map design", "landmarks",
"Cartes transscalaires", "anchor", "disorientation", "[INFO]Computer Science [cs]",
"[SHS.GEO]Humanities and Social Sciences/Geography", "cognitive cartography",
"eye-tracking", "Computers in Earth Sciences", "Topographic map", "National Mapping Agency",
"General Medicine", "Geography, Planning and Development", "multi-scales",
"pan-scalar maps", "Selection", "cartography", "General Earth and Planetary Sciences",
"progressiveness", "map generalisation", "Eye-tracker", "zoom", "algorithms", "Map Design",
"cartography, map generalisation, zoom, multi-scale map", "Interactive maps",
"Map generalisation", "Earth and Planetary Sciences (miscellaneous)",
"Cartographic generalization", "rivers", "Benchmark", "General Environmental Science",
"open source", "drawing", "Constraint", "Multi-scale maps"),
Arrays
.asList(
"Where do people look at during multi-scale map tasks?", "FogDetector survey raw data",
"Collection of cartographic disorientation stories", "Anchorwhat dataset",
"BasqueRoads: A Benchmark for Road Network Selection",
"Progressive river network selection for pan-scalar maps",
"BasqueRoads, a dataset to benchmark road selection algorithms",
"Missing the city for buildings? A critical review of pan-scalar map generalization and design in contemporary zoomable maps",
"Empirical approach to advance the generalisation of multi-scale maps",
"L'Alpe d'Huez: a dataset to benchmark topographic map generalisation",
"eye-tracking data from a survey on zooming in a pan-scalar map",
"Material of the experiment 'More is Less' from the MapMuxing project",
"Cartagen4py, an open source Python library for map generalisation",
"LAlpe dHuez: A Benchmark for Topographic Map Generalisation"),
Arrays
.asList(
"50|doi_dedup___::6915135e0aa39f913394513f809ae58a",
"50|doi_dedup___::754e3c283639bc6e104c925ff3e34007",
"50|doi_dedup___::13517477f3c1261d57a3364363ce6ce0",
"50|doi_dedup___::675b16c73accc4e7242bbb4ed9b3724a",
"50|doi_dedup___::94ce09906b2d7d37eb2206cea8a50153",
"50|dedup_wf_002::cc575d5ca5651ff8c3029a3a76e7e70a",
"50|doi_dedup___::c5e52baddda17c755d1bae012a97dc13",
"50|doi_dedup___::4f5f38c9e08fe995f7278963183f8ad4",
"50|doi_dedup___::a9bc4453273b2d02648a5cb453195042",
"50|doi_dedup___::5e893dc0cb7624a33f41c9b428bd59f7",
"50|doi_dedup___::c1ecdef48fd9be811a291deed950e1c5",
"50|doi_dedup___::9e93c8f2d97c35de8a6a57a5b53ef283",
"50|dedup_wf_002::d08be0ed27b13d8a880e891e08d093ea",
"50|doi_dedup___::f8d8b3b9eddeca2fc0e3bc9e63996555"),
"Exploring Multi-Scale Map Generalization and Design",
"This project aims to advance the generalization of multi-scale maps by investigating the impact of different design elements on user experience. The research involves collecting and analyzing data from various sources, including surveys, eye-tracking studies, and user experiments. The goal is to identify best practices for map generalization and design, with a focus on reducing disorientation and improving information retrieval during exploration. The project has led to the development of several datasets, including BasqueRoads, AnchorWhat, and L'Alpe d'Huez, which can be used to benchmark road selection algorithms and topographic map generalization techniques. The research has also resulted in the creation of a Python library, Cartagen4py, for map generalization. The findings of this project have the potential to improve the design and usability of multi-scale maps, making them more effective tools for navigation and information retrieval."));
OtherResearchProduct orp = (OtherResearchProduct) atomicActions.get(0).getPayload();
Relation rel = (Relation) atomicActions.get(1).getPayload();
assertEquals("Exploring Multi-Scale Map Generalization and Design", orp.getTitle().get(0).getValue());
assertEquals("50|raid________::759a564ce5cc7360cab030c517c7366b", rel.getSource());
assertEquals("50|doi_dedup___::6915135e0aa39f913394513f809ae58a", rel.getTarget());
}
}

View File

@ -13,7 +13,7 @@ import eu.dnetlib.dhp.common.collection.HttpClientParams;
class Gtr2PublicationsIteratorTest {
private static final String baseURL = "https://gtr.ukri.org/gtr/api";
private static final String baseURL = "https://gtr.ukri.org/api";
private static final HttpClientParams clientParams = new HttpClientParams();
@ -34,7 +34,7 @@ class Gtr2PublicationsIteratorTest {
@Test
@Disabled
public void testPaging() throws Exception {
final Iterator<String> iterator = new Gtr2PublicationsIterator(baseURL, null, "2", "2", clientParams);
final Iterator<String> iterator = new Gtr2PublicationsIterator(baseURL, null, "2", "3", clientParams);
while (iterator.hasNext()) {
Thread.sleep(300);
@ -47,9 +47,9 @@ class Gtr2PublicationsIteratorTest {
@Test
@Disabled
public void testOnePage() throws Exception {
final Iterator<String> iterator = new Gtr2PublicationsIterator(baseURL, null, "12", "12", clientParams);
final Iterator<String> iterator = new Gtr2PublicationsIterator(baseURL, null, "379", "380", clientParams);
final int count = iterateAndCount(iterator);
assertEquals(20, count);
assertEquals(50, count);
}
@Test

View File

@ -0,0 +1,28 @@
package eu.dnetlib.dhp.collection.plugin.zenodo;
import java.util.zip.GZIPInputStream;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
public class ZenodoPluginCollectionTest {
@Test
public void testZenodoIterator() throws Exception {
final GZIPInputStream gis = new GZIPInputStream(
getClass().getResourceAsStream("/eu/dnetlib/dhp/collection/zenodo/zenodo.tar.gz"));
try (ZenodoTarIterator it = new ZenodoTarIterator(gis)) {
Assertions.assertTrue(it.hasNext());
int i = 0;
while (it.hasNext()) {
Assertions.assertNotNull(it.next());
i++;
}
Assertions.assertEquals(10, i);
}
}
}

View File

@ -0,0 +1,6 @@
{"raid": "-9222092103004099540", "authors": ["Department of Archaeology & Museums", "Department of Archaeology and Museums", "Department Of Archaeology & Museums"], "subjects": ["Begamganj", "Raisen", "Bhopal", "Budhni", "Malwa site survey", "सीहोर", "Gauharganj", "बुधनी", "Budni", "Berasia"], "titles": ["Malwa site survey : Raisen District, Begamganj Tahsīl, photographic documentation", "Malwa site survey : Bhopal District, photographic documentation (version 1, TIFF files)", "Malwa site survey : Raisen District, Gauharganj Tahsīl, village finds", "Malwa site survey : Sehore सीहोर District, Budni Tahsīl, photographic documentation (part 1)", "Malwa site survey: Bhopal District, Berasia Tahsīl, photographic documentation (with villages named)", "Malwa site survey : Sehore सीहोर District, Budni Tahsīl, photographic documentation (part 2)", "Malwa site survey : Bhopal District, photographic documentation (version 2, JPEG files)"], "ids": ["50|doi_dedup___::7523d165970830dd857e6cbea4302adf", "50|doi_dedup___::02309ae8a9fae291df321e317f5c5330", "50|doi_dedup___::95347ba2c4264414fab39712ee7fe481", "50|doi_dedup___::970aa708fe667596754fd02a708780f5", "50|doi_dedup___::b7cd9128cc53b1257a4f000347f339b0", "50|doi_dedup___::c7d65da0ecedef4d2c702b9db197d90c", "50|doi_dedup___::addbb67cf5046e340f342ba091bcebfa"], "title": "Documentation of Malwa Region", "summary": "This project involves the documentation of the Malwa region through photographic surveys. The surveys were conducted by the Department of Archaeology and Museums, Madhya Pradesh, and cover various districts and tahsils. The documentation includes photographic records of sites, villages, and other relevant features. The project aims to provide a comprehensive understanding of the region's cultural and historical significance.", "startDate": "2019-03-06", "endDate": "2019-03-08"}
{"raid": "-9221424331076109424", "authors": ["Hutchings, Judy", "Ward, Catherine", "Baban, Adriana", "D<><44>nil<69><6C>, Ingrid", "Frantz, Inga", "Gardner, Frances", "Lachman, Jamie", "Lachman, Jamie M.", "Foran, Heather", "Heinrichs, Nina", "Murphy, Hugh", "B<><42>ban, Adriana", "Raleva, Marija", "Fang, Xiangming", "Jansen, Elena", "Taut, Diana", "Foran, Heather M.", "T<><54>ut, Diana", "Ward, Catherine L.", "Williams, Margiad", "Lesco, Galina", "Brühl, Antonia"], "subjects": ["3. Good health", "5. Gender equality", "Criminology not elsewhere classified", "1. No poverty", "2. Zero hunger"], "titles": ["sj-docx-1-vaw-10.1177_10778012231188090 - Supplemental material for Co-Occurrence of Intimate Partner Violence Against Mothers and Maltreatment of Their Children With Behavioral Problems in Eastern Europe", "Hunger in vulnerable families in Southeastern Europe: Associations with health and violence", "Prevention of child mental health problems through parenting interventions in Southeastern Europe (RISE): study protocol for a multi-site randomised controlled trial"], "ids": ["50|doi_dedup___::a70015063e5400dae2e097ee10b4a589", "50|doi_dedup___::6e1d12026fcde9087724622ccdeed430", "50|doi_dedup___::5b7bd5d46c5d95e2ef5b36663504a67e"], "title": "Exploring the Impact of Hunger and Violence on Child Health in Southeastern Europe", "summary": "This study aims to investigate the relationship between hunger, violence, and child health in vulnerable families in Southeastern Europe. The research will explore the experiences of families in FYR Macedonia, Republic of Moldova, and Romania, and examine the associations between hunger, maltreatment, and other health indicators. The study will also test the efficacy of a parenting intervention targeting child behavioral problems in alleviating these issues. The findings of this research will contribute to the development of effective interventions to address the complex needs of vulnerable families in the region.", "startDate": "2019-06-04", "endDate": "2023-01-01"}
{"raid": "-9219052635741785098", "authors": ["Berli, Justin", "Le Mao, Bérénice", "Guillaume Touya", "Wenclik, Laura", "Courtial, Azelle", "Muehlenhaus, Ian", "Justin Berli", "Touya, Guillaume", "Gruget, Maïeul", "Azelle Courtial", "Ian Muhlenhaus", "Maïeul Gruget", "Marion Dumont", "Maïeul GRUGET", "Cécile Duchêne"], "subjects": ["cartography, zoom, pan, desert fog", "Road network", "zooming", "Pan-scalar maps", "pan-scalar map", "Python library", "QGIS", "map design", "landmarks", "Cartes transscalaires", "anchor", "disorientation", "[INFO]Computer Science [cs]", "[SHS.GEO]Humanities and Social Sciences/Geography", "cognitive cartography", "eye-tracking", "Computers in Earth Sciences", "Topographic map", "National Mapping Agency", "General Medicine", "Geography, Planning and Development", "multi-scales", "pan-scalar maps", "Selection", "cartography", "General Earth and Planetary Sciences", "progressiveness", "map generalisation", "Eye-tracker", "zoom", "algorithms", "Map Design", "cartography, map generalisation, zoom, multi-scale map", "Interactive maps", "Map generalisation", "Earth and Planetary Sciences (miscellaneous)", "Cartographic generalization", "rivers", "Benchmark", "General Environmental Science", "open source", "drawing", "Constraint", "Multi-scale maps"], "titles": ["Where do people look at during multi-scale map tasks?", "FogDetector survey raw data", "Collection of cartographic disorientation stories", "Anchorwhat dataset", "BasqueRoads: A Benchmark for Road Network Selection", "Progressive river network selection for pan-scalar maps", "BasqueRoads, a dataset to benchmark road selection algorithms", "Missing the city for buildings? A critical review of pan-scalar map generalization and design in contemporary zoomable maps", "Empirical approach to advance the generalisation of multi-scale maps", "L'Alpe d'Huez: a dataset to benchmark topographic map generalisation", "eye-tracking data from a survey on zooming in a pan-scalar map", "Material of the experiment \"More is Less\" from the MapMuxing project", "Cartagen4py, an open source Python library for map generalisation", "LAlpe dHuez: A Benchmark for Topographic Map Generalisation"], "ids": ["50|doi_dedup___::6915135e0aa39f913394513f809ae58a", "50|doi_dedup___::754e3c283639bc6e104c925ff3e34007", "50|doi_dedup___::13517477f3c1261d57a3364363ce6ce0", "50|doi_dedup___::675b16c73accc4e7242bbb4ed9b3724a", "50|doi_dedup___::94ce09906b2d7d37eb2206cea8a50153", "50|dedup_wf_002::cc575d5ca5651ff8c3029a3a76e7e70a", "50|doi_dedup___::c5e52baddda17c755d1bae012a97dc13", "50|doi_dedup___::4f5f38c9e08fe995f7278963183f8ad4", "50|doi_dedup___::a9bc4453273b2d02648a5cb453195042", "50|doi_dedup___::5e893dc0cb7624a33f41c9b428bd59f7", "50|doi_dedup___::c1ecdef48fd9be811a291deed950e1c5", "50|doi_dedup___::9e93c8f2d97c35de8a6a57a5b53ef283", "50|dedup_wf_002::d08be0ed27b13d8a880e891e08d093ea", "50|doi_dedup___::f8d8b3b9eddeca2fc0e3bc9e63996555"], "title": "Exploring Multi-Scale Map Generalization and Design", "summary": "This project aims to advance the generalization of multi-scale maps by investigating the impact of different design elements on user experience. The research involves collecting and analyzing data from various sources, including surveys, eye-tracking studies, and user experiments. The goal is to identify best practices for map generalization and design, with a focus on reducing disorientation and improving information retrieval during exploration. The project has led to the development of several datasets, including BasqueRoads, AnchorWhat, and L'Alpe d'Huez, which can be used to benchmark road selection algorithms and topographic map generalization techniques. The research has also resulted in the creation of a Python library, Cartagen4py, for map generalization. The findings of this project have the potential to improve the design and usability of multi-scale maps, making them more effective tools for navigation and information retrieval.", "startDate": "2021-09-10", "endDate": "2024-02-16"}
{"raid": "-9216828847055450272", "authors": ["Grey, Alan", "Gorelov, Sergey", "Pall, Szilard", "Merz, Pascal", "Justin A., Lemkul", "Szilárd Páll", "Pasquadibisceglie, Andrea", "Kutzner, Carsten", "Schulz, Roland", "Nabet, Julien", "Abraham, Mark", "Jalalypour, Farzaneh", "Lundborg, Magnus", "Gray, Alan", "Villa, Alessandra", "Berk Hess", "Santuz, Hubert", "Irrgang, M. Eric", "Wingbermuehle, Sebastian", "Lemkul, Justin A.", "Jordan, Joe", "Pellegrino, Michele", "Doijade, Mahesh", "Shvetsov, Alexey", "Hess, Berk", "Behera, Sudarshan", "Andrey Alekseenko", "Shugaeva, Tatiana", "Fleischmann, Stefan", "Bergh, Cathrine", "Morozov, Dmitry", "Adam Hospital", "Briand, Eliane", "Lindahl, Erik", "Brown, Ania", "Marta Lloret Llinares", "Miletic, Vedran", "Alekseenko, Andrey", "Gouaillardet, Gilles", "Fiorin, Giacomo", "Basov, Vladimir"], "subjects": ["webinar"], "titles": ["Introduction to HPC: molecular dynamics simulations with GROMACS: log files", "BioExcel webinar #73: Competency frameworks to support training design and professional development", "Introduction to HPC: molecular dynamics simulations with GROMACS: output files - Devana", "GROMACS 2024.0 Manual", "BioExcel Webinar #71: GROMACS-PMX for accurate estimation of free energy differences", "Introduction to HPC: molecular dynamics simulations with GROMACS: input files", "BioExcel Webinar #68: What's new in GROMACS 2023", "BioExcel Webinar #69: BioBB-Wfs and BioBB-API, integrated web-based platform and programmatic interface for biomolecular simulations workflows using the BioExcel Building Blocks library", "GROMACS 2024-beta Source code"], "ids": ["50|doi_dedup___::8318fbc815ee1943c3269be7567f220b", "50|doi_dedup___::9530e03fb2aac63e82b18a40dc09e32c", "50|doi_dedup___::30174ab31075e76a428ca5b4f4d236b8", "50|doi_________::70b7c6dce09ae6f1361d22913fdf95eb", "50|doi_dedup___::337dd48600618f3c06257edd750d6201", "50|doi_dedup___::d622992ba9077617f37ebd268b3e806d", "50|doi_dedup___::0b0bcc6825d6c052c37882fd5cfc1e8c", "50|doi_dedup___::4b1541a7cee32527c65ace5d1ed57335", "50|doi_dedup___::1379861df59bd755e4fb39b9f95ffbd3"], "title": "Exploring High-Performance Computing and Biomolecular Simulations", "summary": "This project involves exploring high-performance computing (HPC) and biomolecular simulations using GROMACS. The objectives include understanding molecular dynamics simulations, log files, input files, and output files. Additionally, the project aims to explore competency frameworks for professional development, specifically in the field of computational biomolecular research. The tools and techniques used will include GROMACS, BioExcel Building Blocks, and competency frameworks. The expected outcomes include a deeper understanding of HPC and biomolecular simulations, as well as the development of skills in using GROMACS and BioExcel Building Blocks. The project will also contribute to the development of competency frameworks for professional development in the field of computational biomolecular research.", "startDate": "2023-04-25", "endDate": "2024-01-30"}
{"raid": "-9210544816395499758", "authors": ["Bateson, Melissa", "Andrews, Clare", "Verhulst, Simon", "Nettle, Daniel", "Zuidersma, Erica"], "subjects": ["2. Zero hunger"], "titles": ["Exposure to food insecurity increases energy storage and reduces somatic maintenance in European starlings", "Data and code archive for Andrews et al. 'Exposure to food insecurity increases energy storage and reduces somatic maintenance in European starlings'"], "ids": ["50|doi_dedup___::176117239be06189523c253e0ca9c5ec", "50|doi_dedup___::343e0b0ddf0d54763a89a62af1f7a379"], "title": "Investigating the Effects of Food Insecurity on Energy Storage and Somatic Maintenance in European Starlings", "summary": "This study examines the impact of food insecurity on energy storage and somatic maintenance in European starlings. The research involved exposing juvenile starlings to either uninterrupted food availability or a regime of unpredictable food unavailability. The results show that birds exposed to food insecurity stored more energy, but at the expense of somatic maintenance and repair. The study provides insights into the adaptive responses of birds to food scarcity and the trade-offs involved in energy storage and maintenance.", "startDate": "2021-06-28", "endDate": "2021-06-28"}
{"raid": "-9208499171224730388", "authors": ["Maniati, Eleni", "Bakker, Bjorn", "McClelland, Sarah E.", "Shaikh, Nadeem", "De Angelis, Simone", "Johnson, Sarah C.", "Wang, Jun", "Foijer, Floris", "Spierings, Diana C. J.", "Boemo, Michael A.", "Wardenaar, René", "Mazzagatti, Alice"], "subjects": [], "titles": ["Additional file 2 of Replication stress generates distinctive landscapes of DNA copy number alterations and chromosome scale losses", "Additional file 5 of Replication stress generates distinctive landscapes of DNA copy number alterations and chromosome scale losses"], "ids": ["50|doi_dedup___::a1bfeb173971f74a274fab8bdd78a4bc", "50|doi_dedup___::3d6e151aaeb2f7c40a320207fdd80ade"], "title": "Analysis of DNA Copy Number Alterations and Chromosome Scale Losses", "summary": "This study analyzed the effects of replication stress on DNA copy number alterations and chromosome scale losses. The results show distinctive landscapes of these alterations and losses, which were further investigated in additional files. The study provides valuable insights into the mechanisms of replication stress and its impact on genomic stability.", "startDate": "2022-01-01", "endDate": "2022-01-01"}

View File

@ -0,0 +1,232 @@
{
"indexed": {
"date-parts": [
[
2022,
4,
3
]
],
"date-time": "2022-04-03T01:45:59Z",
"timestamp": 1648950359167
},
"reference-count": 0,
"publisher": "American Society of Clinical Oncology (ASCO)",
"issue": "18_suppl",
"content-domain": {
"domain": [],
"crossmark-restriction": false
},
"short-container-title": [
"JCO"
],
"published-print": {
"date-parts": [
[
2007,
6,
20
]
]
},
"abstract": "<jats:p> 3507 </jats:p><jats:p> Purpose: To detect IGF-1R on circulating tumor cells (CTCs) as a biomarker in the clinical development of a monoclonal human antibody, CP-751,871, targeting IGF-1R. Experimental Design: An automated sample preparation and analysis system for enumerating CTCs (Celltracks) was adapted for detecting IGF-1R positive CTCs with a diagnostic antibody targeting a different IGF-1R epitope to CP-751,871. This assay was utilized in three phase I trials of CP-751,871 as a single agent or with chemotherapy and was validated using cell lines and blood samples from healthy volunteers and patients with metastatic carcinoma. Results: There was no interference between the analytical and therapeutic antibodies. CP-751,871 was well tolerated as a single agent, and in combination with docetaxel or carboplatin and paclitaxel, at doses ranging from 0.05 mg/kg to 20 mg/kg. Eighty patients were enrolled on phase 1 studies of CP-751,871, with 47 (59%) patients having CTCs detected during the study. Prior to treatment 26 patients (33%) had CTCs, with 23 having detectable IGF-1R positive CTCs. CP-751,871 alone, and CP-751,871 with cytotoxic chemotherapy, decreased CTCs and IGF-1R positive CTCs; these increased towards the end of the 21-day cycle in some patients, falling again with retreatment. CTCs were commonest in advanced hormone refractory prostate cancer (11/20). Detectable IGF-1R expression on CTCs before treatment with CP-751,871 and docetaxel was associated with a higher frequency of PSA decline by more than 50% (6/10 vs 2/8 patients). A relationship was observed between sustained falls in CTCs counts and PSA declines by more than 50%. Conclusions: IGF-1R expression is detectable by immunofluorescence on CTCs. These data support the further evaluation of CTCs in pharmacodynamic studies and patient selection, particularly in advanced prostate cancer. </jats:p><jats:p> No significant financial relationships to disclose. </jats:p>",
"DOI": "10.1200/jco.2007.25.18_suppl.3507",
"type": "journal-article",
"created": {
"date-parts": [
[
2020,
3,
6
]
],
"date-time": "2020-03-06T20:50:42Z",
"timestamp": 1583527842000
},
"page": "3507-3507",
"source": "Crossref",
"is-referenced-by-count": 0,
"title": [
"Circulating tumor cells expressing the insulin growth factor-1 receptor (IGF-1R): Method of detection, incidence and potential applications"
],
"prefix": "10.1200",
"volume": "25",
"author": [
{
"given": "J. S.",
"family": "de Bono",
"sequence": "first",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "A.",
"family": "Adjei",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "G.",
"family": "Attard",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "M.",
"family": "Pollak",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "P.",
"family": "Fong",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "P.",
"family": "Haluska",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "L.",
"family": "Roberts",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "D.",
"family": "Chainese",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "L.",
"family": "Terstappen",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
},
{
"given": "A.",
"family": "Gualberto",
"sequence": "additional",
"affiliation": [
{
"name": "Royal Marsden Hospital, Surrey, United Kingdom; Mayo Clinic, Rochester, MN; McGill University & Lady Davis Research Institute, Montreal, PQ, Canada; Pfizer Global Research & Development, New London, CT; Immunicon Corporation, Huntingdon Valley, PA"
}
]
}
],
"member": "233",
"container-title": [
"Journal of Clinical Oncology"
],
"original-title": [],
"language": "en",
"deposited": {
"date-parts": [
[
2020,
3,
6
]
],
"date-time": "2020-03-06T20:51:03Z",
"timestamp": 1583527863000
},
"score": 1,
"resource": {
"primary": {
"URL": "http://ascopubs.org/doi/10.1200/jco.2007.25.18_suppl.3507"
}
},
"subtitle": [],
"short-title": [],
"issued": {
"date-parts": [
[
2007,
6,
20
]
]
},
"references-count": 0,
"journal-issue": {
"issue": "18_suppl",
"published-print": {
"date-parts": [
[
2007,
6,
20
]
]
}
},
"alternative-id": [
"10.1200/jco.2007.25.18_suppl.3507"
],
"URL": "http://dx.doi.org/10.1200/jco.2007.25.18_suppl.3507",
"relation": {},
"ISSN": [
"0732-183X",
"1527-7755"
],
"issn-type": [
{
"value": "0732-183X",
"type": "print"
},
{
"value": "1527-7755",
"type": "electronic"
}
],
"subject": [],
"published": {
"date-parts": [
[
2007,
6,
20
]
]
}
}

View File

@ -0,0 +1,157 @@
<PubmedArticle>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<PMID Version="1">37318999</PMID>
<DateCompleted>
<Year>2024</Year>
<Month>02</Month>
<Day>09</Day>
</DateCompleted>
<DateRevised>
<Year>2024</Year>
<Month>02</Month>
<Day>09</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1522-1229</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>47</Volume>
<Issue>3</Issue>
<PubDate>
<Year>2023</Year>
<Month>Sep</Month>
<Day>01</Day>
</PubDate>
</JournalIssue>
<Title>Advances in physiology education</Title>
<ISOAbbreviation>Adv Physiol Educ</ISOAbbreviation>
</Journal>
<ArticleTitle>Providing the choice of in-person or videoconference attendance in a clinical physiology course may harm learning outcomes for the entire cohort.</ArticleTitle>
<Pagination>
<MedlinePgn>548-556</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1152/advan.00160.2022</ELocationID>
<Abstract>
<AbstractText>Clinical Physiology 1 and 2 are flipped classes in which students watch prerecorded videos before class. During the 3-h class, students take practice assessments, work in groups on critical thinking exercises, work through case studies, and engage in drawing exercises. Due to the COVID pandemic, these courses were transitioned from in-person classes to online classes. Despite the university's return-to-class policy, some students were reluctant to return to in-person classes; therefore during the 2021-2022 academic year, Clinical Physiology 1 and 2 were offered as flipped, hybrid courses. In a hybrid format, students either attended the synchronous class in person or online. Here we evaluate the learning outcomes and the perceptions of the learning experience for students who attended Clinical Physiology 1 and 2 either online (2020-2021) or in a hybrid format (2021-2022). In addition to exam scores, in-class surveys and end of course evaluations were compiled to describe the student experience in the flipped hybrid setting. Retrospective linear mixed-model regression analysis of exam scores revealed that a hybrid modality (2021-2022) was associated with lower exam scores when controlling for sex, graduate/undergraduate status, delivery method, and the order in which the courses were taken (<i>F</i> test: <i>F</i> = 8.65, df1 = 2, df2 = 179.28, <i>P</i> = 0.0003). In addition, being a Black Indigenous Person of Color (BIPOC) student is associated with a lower exam score, controlling for the same previous factors (<i>F</i> test: <i>F</i> = 4.23, df1 = 1, df2 = 130.28, <i>P</i> = 0.04), albeit with lower confidence; the BIPOC representation in this sample is small (BIPOC: <i>n</i> = 144; total: <i>n</i> = 504). There is no significant interaction between the hybrid modality and race, meaning that BIPOC and White students are both negatively affected in a hybrid flipped course. Instructors should consider carefully about offering hybrid courses and build in extra student support.<b>NEW &amp; NOTEWORTHY</b> The transition from online to in-person teaching has been as challenging as the original transition to remote teaching with the onset of the pandemic. Since not all students were ready to return to the classroom, students could choose to take this course in person or online. This arrangement provided flexibility and opportunities for innovative class activities for students but introduced tradeoffs in lower test scores from the hybrid modality than fully online or fully in-person modalities.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Anderson</LastName>
<ForeName>Lisa Carney</ForeName>
<Initials>LC</Initials>
<Identifier Source="ORCID">0000-0003-2261-1921</Identifier>
<AffiliationInfo>
<Affiliation>Department of Integrative Biology and Physiology, University of Minnesota, Minneapolis, Minnesota, United States.</Affiliation>
<Identifier Source="ROR">https://ror.org/017zqws13</Identifier>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Jacobson</LastName>
<ForeName>Tate</ForeName>
<Initials>T</Initials>
<AffiliationInfo>
<Affiliation>Department of Statistics, University of Minnesota, Minneapolis, Minnesota, United States.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2023</Year>
<Month>06</Month>
<Day>15</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>Adv Physiol Educ</MedlineTA>
<NlmUniqueID>100913944</NlmUniqueID>
<ISSNLinking>1043-4046</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D010827" MajorTopicYN="Y">Physiology</DescriptorName>
<QualifierName UI="Q000193" MajorTopicYN="N">education</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012189" MajorTopicYN="N">Retrospective Studies</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D007858" MajorTopicYN="N">Learning</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D058873" MajorTopicYN="N">Pandemics</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000086382" MajorTopicYN="N">COVID-19</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012044" MajorTopicYN="N">Regression Analysis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D013334" MajorTopicYN="N">Students</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008297" MajorTopicYN="N">Male</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D005260" MajorTopicYN="N">Female</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D044465" MajorTopicYN="N">White People</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D044383" MajorTopicYN="N">Black People</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D020375" MajorTopicYN="N">Education, Distance</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D003479" MajorTopicYN="N">Curriculum</DescriptorName>
</MeshHeading>
</MeshHeadingList>
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">flipped teaching</Keyword>
<Keyword MajorTopicYN="N">hybrid teaching</Keyword>
<Keyword MajorTopicYN="N">inequity</Keyword>
<Keyword MajorTopicYN="N">learning outcomes</Keyword>
<Keyword MajorTopicYN="N">responsive teaching</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="medline">
<Year>2023</Year>
<Month>7</Month>
<Day>21</Day>
<Hour>6</Hour>
<Minute>44</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2023</Year>
<Month>6</Month>
<Day>15</Day>
<Hour>19</Hour>
<Minute>14</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2023</Year>
<Month>6</Month>
<Day>15</Day>
<Hour>12</Hour>
<Minute>53</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">37318999</ArticleId>
<ArticleId IdType="doi">10.1152/advan.00160.2022</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>

View File

@ -3,12 +3,15 @@ package eu.dnetlib.dhp.collection.crossref
import com.fasterxml.jackson.databind.ObjectMapper
import eu.dnetlib.dhp.aggregation.AbstractVocabularyTest
import eu.dnetlib.dhp.collection.crossref.Crossref2Oaf.TransformationType
import eu.dnetlib.dhp.schema.oaf.Publication
import org.apache.commons.io.IOUtils
import org.junit.jupiter.api.{BeforeEach, Test}
import org.junit.jupiter.api.{Assertions, BeforeEach, Test}
import org.junit.jupiter.api.extension.ExtendWith
import org.mockito.junit.jupiter.MockitoExtension
import org.slf4j.{Logger, LoggerFactory}
import scala.collection.JavaConverters.asScalaBufferConverter
@ExtendWith(Array(classOf[MockitoExtension]))
class CrossrefMappingTest extends AbstractVocabularyTest {
@ -25,8 +28,32 @@ class CrossrefMappingTest extends AbstractVocabularyTest {
val input =
IOUtils.toString(getClass.getResourceAsStream("/eu/dnetlib/dhp/collection/crossref/issn_pub.json"), "utf-8")
println(Crossref2Oaf.convert(input, vocabularies, TransformationType.All))
Crossref2Oaf
.convert(input, vocabularies, TransformationType.All)
.foreach(record => {
Assertions.assertNotNull(record)
})
}
@Test
def mappingAffiliation(): Unit = {
val input =
IOUtils.toString(
getClass.getResourceAsStream("/eu/dnetlib/dhp/collection/crossref/affiliationTest.json"),
"utf-8"
)
val data = Crossref2Oaf.convert(input, vocabularies, TransformationType.OnlyResult)
data.foreach(record => {
Assertions.assertNotNull(record)
Assertions.assertTrue(record.isInstanceOf[Publication])
val publication = record.asInstanceOf[Publication]
publication.getAuthor.asScala.foreach(author => {
Assertions.assertNotNull(author.getRawAffiliationString)
Assertions.assertTrue(author.getRawAffiliationString.size() > 0)
})
})
println(mapper.writerWithDefaultPrettyPrinter().writeValueAsString(data.head))
}
}

View File

@ -5,7 +5,10 @@ import eu.dnetlib.dhp.aggregation.AbstractVocabularyTest
import eu.dnetlib.dhp.schema.oaf.utils.PidType
import eu.dnetlib.dhp.schema.oaf.{Oaf, Publication, Relation, Result}
import eu.dnetlib.dhp.sx.bio.BioDBToOAF.ScholixResolved
import eu.dnetlib.dhp.sx.bio.pubmed.{PMArticle, PMParser, PMSubject, PubMedToOaf}
import eu.dnetlib.dhp.sx.bio.ebi.SparkCreatePubmedDump
import eu.dnetlib.dhp.sx.bio.pubmed._
import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.json4s.DefaultFormats
import org.json4s.JsonAST.{JField, JObject, JString}
import org.json4s.jackson.JsonMethods.parse
@ -13,14 +16,16 @@ import org.junit.jupiter.api.Assertions._
import org.junit.jupiter.api.extension.ExtendWith
import org.junit.jupiter.api.{BeforeEach, Test}
import org.mockito.junit.jupiter.MockitoExtension
import org.slf4j.LoggerFactory
import java.io.{BufferedReader, InputStream, InputStreamReader}
import java.util.regex.Pattern
import java.util.zip.GZIPInputStream
import javax.xml.stream.XMLInputFactory
import scala.collection.JavaConverters._
import scala.collection.mutable
import scala.collection.mutable.ListBuffer
import scala.io.Source
import scala.xml.pull.XMLEventReader
@ExtendWith(Array(classOf[MockitoExtension]))
class BioScholixTest extends AbstractVocabularyTest {
@ -48,6 +53,76 @@ class BioScholixTest extends AbstractVocabularyTest {
}
}
@Test
def testPid(): Unit = {
val pids = List(
"0000000163025705",
"000000018494732X",
"0000000308873343",
"0000000335964515",
"0000000333457333",
"0000000335964515",
"0000000302921949",
"http://orcid.org/0000-0001-8567-3543",
"http://orcid.org/0000-0001-7868-8528",
"0000-0001-9189-1440",
"0000-0003-3727-9247",
"0000-0001-7246-1058",
"000000033962389X",
"0000000330371470",
"0000000171236123",
"0000000272569752",
"0000000293231371",
"http://orcid.org/0000-0003-3345-7333",
"0000000340145688",
"http://orcid.org/0000-0003-4894-1689"
)
pids.foreach(pid => {
val pidCleaned = new PMIdentifier(pid, "ORCID").getPid
// assert pid is in the format of ORCID
println(pidCleaned)
assertTrue(pidCleaned.matches("[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X]"))
})
}
def extractAffiliation(s: String): List[String] = {
val regex: String = "<Affiliation>(.*)<\\/Affiliation>"
val pattern = Pattern.compile(regex, Pattern.MULTILINE)
val matcher = pattern.matcher(s)
val l: mutable.ListBuffer[String] = mutable.ListBuffer()
while (matcher.find()) {
l += matcher.group(1)
}
l.toList
}
case class AuthorPID(pidType: String, pid: String) {}
def extractAuthorIdentifier(s: String): List[AuthorPID] = {
val regex: String = "<Identifier Source=\"(.*)\">(.*)<\\/Identifier>"
val pattern = Pattern.compile(regex, Pattern.MULTILINE)
val matcher = pattern.matcher(s)
val l: mutable.ListBuffer[AuthorPID] = mutable.ListBuffer()
while (matcher.find()) {
l += AuthorPID(pidType = matcher.group(1), pid = matcher.group(2))
}
l.toList
}
@Test
def testParsingPubmed2(): Unit = {
val mapper = new ObjectMapper()
val xml = IOUtils.toString(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/single_pubmed.xml"))
val parser = new PMParser2()
val article = parser.parse(xml)
// println(mapper.writerWithDefaultPrettyPrinter().writeValueAsString(article))
println(mapper.writerWithDefaultPrettyPrinter().writeValueAsString(PubMedToOaf.convert(article, vocabularies)))
}
@Test
def testEBIData() = {
val inputFactory = XMLInputFactory.newInstance
@ -124,6 +199,14 @@ class BioScholixTest extends AbstractVocabularyTest {
}
}
def testPubmedSplitting(): Unit = {
val spark: SparkSession = SparkSession.builder().appName("test").master("local").getOrCreate()
new SparkCreatePubmedDump("", Array.empty, LoggerFactory.getLogger(getClass))
.createPubmedDump(spark, "/home/sandro/Downloads/pubmed", "/home/sandro/Downloads/pubmed_mapped", vocabularies)
}
@Test
def testPubmedOriginalID(): Unit = {
val article: PMArticle = new PMArticle

View File

@ -135,7 +135,7 @@ public class DedupRecordFactory {
return Collections.emptyIterator();
}
OafEntity mergedEntity = MergeUtils.mergeGroup(dedupId, cliques.iterator());
OafEntity mergedEntity = MergeUtils.mergeGroup(cliques.iterator());
// dedup records do not have date of transformation attribute
mergedEntity.setDateoftransformation(null);
mergedEntity

View File

@ -69,6 +69,7 @@ public class SparkPropagateRelation extends AbstractSparkAction {
Dataset<Relation> mergeRels = spark
.read()
.schema(REL_BEAN_ENC.schema())
.load(DedupUtility.createMergeRelPath(workingPath, "*", "*"))
.as(REL_BEAN_ENC);

View File

@ -46,8 +46,8 @@ class DatasetMergerTest implements Serializable {
}
@Test
void datasetMergerTest() throws InstantiationException, IllegalAccessException, InvocationTargetException {
Dataset pub_merged = MergeUtils.mergeGroup(dedupId, datasets.stream().map(Tuple2::_2).iterator());
void datasetMergerTest() {
Dataset pub_merged = MergeUtils.mergeGroup(datasets.stream().map(Tuple2::_2).iterator());
// verify id
assertEquals(dedupId, pub_merged.getId());

View File

@ -21,17 +21,15 @@ class DecisionTreeTest {
void testJPath() throws IOException {
DedupConfig conf = DedupConfig
.load(IOUtils.toString(getClass().getResourceAsStream("dedup_conf_organization.json")));
.load(IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/jpath/dedup_conf_organization.json")));
final String org = IOUtils.toString(getClass().getResourceAsStream("organization.json"));
final String org = IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/jpath/organization.json"));
Row row = SparkModel.apply(conf).rowFromJson(org);
System.out.println("row = " + row);
Assertions.assertNotNull(row);
Assertions.assertTrue(StringUtils.isNotBlank(row.getAs("identifier")));
System.out.println("row = " + row.getAs("countrytitle"));
}
@Test
@ -44,7 +42,7 @@ class DecisionTreeTest {
.getResourceAsStream(
"/eu/dnetlib/dhp/dedup/conf/org.curr.conf.json")));
final String org = IOUtils.toString(getClass().getResourceAsStream("organization_example1.json"));
final String org = IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/jpath/organization_example1.json"));
Row row = SparkModel.apply(conf).rowFromJson(org);
// to check that the same parsing returns the same row

View File

@ -190,7 +190,7 @@ public class SparkDedupTest implements Serializable {
System.out.println("orp_simrel = " + orp_simrel);
if (CHECK_CARDINALITIES) {
assertEquals(742, orgs_simrel);
assertEquals(720, orgs_simrel);
assertEquals(566, pubs_simrel);
assertEquals(113, sw_simrel);
assertEquals(148, ds_simrel);
@ -251,7 +251,7 @@ public class SparkDedupTest implements Serializable {
// entities simrels supposed to be equal to the number of previous step (no rels in whitelist)
if (CHECK_CARDINALITIES) {
assertEquals(742, orgs_simrel);
assertEquals(720, orgs_simrel);
assertEquals(566, pubs_simrel);
assertEquals(148, ds_simrel);
assertEquals(280, orp_simrel);
@ -440,25 +440,26 @@ public class SparkDedupTest implements Serializable {
.count();
final List<Relation> merges = pubs
.filter("source == '50|arXiv_dedup_::c93aeb433eb90ed7a86e29be00791b7c'")
.filter("source == '50|doi_dedup___::d5021b53204e4fdeab6ff5d5bc468032'")// and relClass = '"+ModelConstants.MERGES+"'")
.collectAsList();
assertEquals(1, merges.size());
assertEquals(4, merges.size());
Set<String> dups = Sets
.newHashSet(
"50|doi_________::3b1d0d8e8f930826665df9d6b82fbb73",
"50|doi_________::d5021b53204e4fdeab6ff5d5bc468032",
"50|arXiv_______::c93aeb433eb90ed7a86e29be00791b7c");
"50|arXiv_______::c93aeb433eb90ed7a86e29be00791b7c",
"50|arXiv_dedup_::c93aeb433eb90ed7a86e29be00791b7c");
merges.forEach(r -> {
assertEquals(ModelConstants.RESULT_RESULT, r.getRelType());
assertEquals(ModelConstants.DEDUP, r.getSubRelType());
assertEquals(ModelConstants.IS_MERGED_IN, r.getRelClass());
assertEquals(ModelConstants.MERGES, r.getRelClass());
assertTrue(dups.contains(r.getTarget()));
});
final List<Relation> mergedIn = pubs
.filter("target == '50|arXiv_dedup_::c93aeb433eb90ed7a86e29be00791b7c'")
.filter("target == '50|doi_dedup___::d5021b53204e4fdeab6ff5d5bc468032'")
.collectAsList();
assertEquals(3, mergedIn.size());
assertEquals(4, mergedIn.size());
mergedIn.forEach(r -> {
assertEquals(ModelConstants.RESULT_RESULT, r.getRelType());
assertEquals(ModelConstants.DEDUP, r.getSubRelType());
@ -473,8 +474,8 @@ public class SparkDedupTest implements Serializable {
System.out.println("orp_mergerel = " + orp_mergerel);
if (CHECK_CARDINALITIES) {
assertEquals(1268, orgs_mergerel);
assertEquals(1156, pubs.count());
assertEquals(1280, orgs_mergerel);
assertEquals(1158, pubs.count());
assertEquals(292, sw_mergerel);
assertEquals(476, ds_mergerel);
assertEquals(742, orp_mergerel);
@ -561,7 +562,7 @@ public class SparkDedupTest implements Serializable {
System.out.println("orp_mergerel = " + orp_mergerel);
if (CHECK_CARDINALITIES) {
assertEquals(1278, orgs_mergerel);
assertEquals(1280, orgs_mergerel);
assertEquals(1156, pubs.count());
assertEquals(292, sw_mergerel);
assertEquals(476, ds_mergerel);
@ -618,7 +619,7 @@ public class SparkDedupTest implements Serializable {
System.out.println("orp_deduprecord = " + orp_deduprecord);
if (CHECK_CARDINALITIES) {
assertEquals(78, orgs_deduprecord);
assertEquals(87, orgs_deduprecord);
assertEquals(96, pubs.count());
assertEquals(47, sw_deduprecord);
assertEquals(97, ds_deduprecord);
@ -761,7 +762,7 @@ public class SparkDedupTest implements Serializable {
if (CHECK_CARDINALITIES) {
assertEquals(930, publications);
assertEquals(831, organizations);
assertEquals(840, organizations);
assertEquals(100, projects);
assertEquals(100, datasource);
assertEquals(196, softwares);

View File

@ -146,7 +146,7 @@ public class SparkOpenorgsDedupTest implements Serializable {
.load(DedupUtility.createSimRelPath(testOutputBasePath, testActionSetId, "organization"))
.count();
assertEquals(92, orgs_simrel);
assertEquals(91, orgs_simrel);
}
@Test
@ -175,7 +175,7 @@ public class SparkOpenorgsDedupTest implements Serializable {
.load(DedupUtility.createSimRelPath(testOutputBasePath, testActionSetId, "organization"))
.count();
assertEquals(128, orgs_simrel);
assertEquals(127, orgs_simrel);
}
@Test

View File

@ -324,7 +324,7 @@ public class SparkPublicationRootsTest implements Serializable {
private void verifyRoot_case_3(Dataset<Publication> roots, Dataset<Publication> pubs) {
Publication root = roots
.filter("id = '50|dedup_wf_001::31ca734cc22181b704c4aa8fd050062a'")
.filter("id = '50|dedup_wf_002::7143f4ff5708f3657db0b7e68ea74d55'")
.first();
assertNotNull(root);

View File

@ -143,7 +143,9 @@ public class SparkPublicationRootsTest2 implements Serializable {
"--graphBasePath", graphInputPath,
"--actionSetId", testActionSetId,
"--isLookUpUrl", "lookupurl",
"--workingPath", workingPath
"--workingPath", workingPath,
"--hiveMetastoreUris", "",
}), spark)
.run(isLookUpService);
@ -153,7 +155,7 @@ public class SparkPublicationRootsTest2 implements Serializable {
.as(Encoders.bean(Relation.class));
assertEquals(
3, merges
4, merges
.filter("relclass == 'isMergedIn'")
.map((MapFunction<Relation, String>) Relation::getTarget, Encoders.STRING())
.distinct()
@ -178,7 +180,7 @@ public class SparkPublicationRootsTest2 implements Serializable {
.textFile(workingPath + "/" + testActionSetId + "/publication_deduprecord")
.map(asEntity(Publication.class), Encoders.bean(Publication.class));
assertEquals(3, roots.count());
assertEquals(4, roots.count());
final Dataset<Publication> pubs = spark
.read()
@ -195,7 +197,6 @@ public class SparkPublicationRootsTest2 implements Serializable {
.collectAsList()
.get(0);
assertEquals(crossref_duplicate.getDateofacceptance().getValue(), root.getDateofacceptance().getValue());
assertEquals(crossref_duplicate.getJournal().getName(), root.getJournal().getName());
assertEquals(crossref_duplicate.getJournal().getIssnPrinted(), root.getJournal().getIssnPrinted());
assertEquals(crossref_duplicate.getPublisher().getValue(), root.getPublisher().getValue());

View File

@ -168,7 +168,7 @@ public class SparkStatsTest implements Serializable {
.load(testOutputBasePath + "/" + testActionSetId + "/otherresearchproduct_blockstats")
.count();
assertEquals(414, orgs_blocks);
assertEquals(406, orgs_blocks);
assertEquals(221, pubs_blocks);
assertEquals(134, sw_blocks);
assertEquals(196, ds_blocks);

View File

@ -19,17 +19,15 @@ class JsonPathTest {
void testJPath() throws IOException {
DedupConfig conf = DedupConfig
.load(IOUtils.toString(getClass().getResourceAsStream("dedup_conf_organization.json")));
.load(IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/jpath/dedup_conf_organization.json")));
final String org = IOUtils.toString(getClass().getResourceAsStream("organization.json"));
final String org = IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/jpath/organization.json"));
Row row = SparkModel.apply(conf).rowFromJson(org);
System.out.println("row = " + row);
Assertions.assertNotNull(row);
Assertions.assertTrue(StringUtils.isNotBlank(row.getAs("identifier")));
System.out.println("row = " + row.getAs("countrytitle"));
}
@Test

View File

@ -24,22 +24,19 @@
"start": {
"fields": [
{
"field": "gridid",
"comparator": "exactMatch",
"field": "pid",
"comparator": "jsonListMatch",
"weight": 1,
"countIfUndefined": "false",
"params": {}
},
{
"field": "rorid",
"comparator": "exactMatch",
"weight": 1,
"countIfUndefined": "false",
"params": {}
"params": {
"jpath_classid": "$.qualifier.classid",
"jpath_value": "$.value",
"mode": "type"
}
}
],
"threshold": 1,
"aggregation": "OR",
"aggregation": "MAX",
"positive": "MATCH",
"negative": "NO_MATCH",
"undefined": "necessaryConditions",
@ -149,11 +146,10 @@
"model" : [
{ "name" : "country", "type" : "String", "path" : "$.country.classid", "infer" : "country", "inferenceFrom" : "$.legalname.value"},
{ "name" : "legalshortname", "type" : "String", "path" : "$.legalshortname.value", "infer" : "city_keyword"},
{ "name" : "original_legalname", "type" : "String", "path" : "$.legalname.value" },
{ "name" : "original_legalname", "type" : "String", "path" : "$.legalname.value", "clean": "title"},
{ "name" : "legalname", "type" : "String", "path" : "$.legalname.value", "infer" : "city_keyword"},
{ "name" : "websiteurl", "type" : "URL", "path" : "$.websiteurl.value" },
{ "name" : "gridid", "type" : "String", "path" : "$.pid[?(@.qualifier.classid =='grid')].value"},
{ "name" : "rorid", "type" : "String", "path" : "$.pid[?(@.qualifier.classid =='ROR')].value"},
{ "name": "pid", "type": "JSON", "path": "$.pid[*]", "overrideMatch": "true"},
{ "name" : "originalId", "type" : "String", "path" : "$.id" }
],
"blacklists" : {},

View File

@ -566,7 +566,26 @@ case object Crossref2Oaf {
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
case _ => logger.debug("no match for " + funder.DOI.get)
//Add for Danish funders
//Independent Research Fund Denmark (IRFD)
case "10.13039/501100004836" =>
generateSimpleRelationFromAward(funder, "irfd________", a => a)
val targetId = getProjectId("irfd________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Carlsberg Foundation (CF)
case "10.13039/501100002808" =>
generateSimpleRelationFromAward(funder, "cf__________", a => a)
val targetId = getProjectId("cf__________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Novo Nordisk Foundation (NNF)
case "10.13039/501100009708" =>
generateSimpleRelationFromAward(funder, "nnf___________", a => a)
val targetId = getProjectId("nnf_________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
case _ => logger.debug("no match for " + funder.DOI.get)
}
} else {

View File

@ -8,6 +8,7 @@ import java.net.HttpURLConnection;
import java.net.URL;
import java.util.List;
import org.apache.commons.lang.StringUtils;
import org.jetbrains.annotations.NotNull;
/**
@ -37,15 +38,15 @@ public class QueryCommunityAPI {
}
public static String community(String id, String baseURL) throws IOException {
public static String subcommunities(String communityId, String baseURL) throws IOException {
return get(baseURL + id);
return get(baseURL + communityId + "/subcommunities");
}
public static String communityDatasource(String id, String baseURL) throws IOException {
return get(baseURL + id + "/contentproviders");
return get(baseURL + id + "/datasources");
}
@ -61,6 +62,10 @@ public class QueryCommunityAPI {
}
public static String propagationOrganizationCommunityMap(String baseURL) throws IOException {
return get(StringUtils.substringBefore(baseURL, "community") + "propagationOrganizationCommunityMap");
}
@NotNull
private static String getBody(HttpURLConnection conn) throws IOException {
String body = "{}";
@ -78,4 +83,24 @@ public class QueryCommunityAPI {
return body;
}
public static String subcommunityDatasource(String communityId, String subcommunityId, String baseURL)
throws IOException {
return get(baseURL + communityId + "/subcommunities/datasources?subCommunityId=" + subcommunityId);
}
public static String subcommunityPropagationOrganization(String communityId, String subcommunityId, String baseURL)
throws IOException {
return get(baseURL + communityId + "/subcommunities/propagationOrganizations?subCommunityId=" + subcommunityId);
}
public static String subcommunityProjects(String communityId, String subcommunityId, String page, String size,
String baseURL) throws IOException {
return get(
baseURL + communityId + "/subcommunities/projects/" + page + "/" + size + "?subCommunityId="
+ subcommunityId);
}
public static String propagationDatasourceCommunityMap(String baseURL) throws IOException {
return get(baseURL + "/propagationDatasourceCommunityMap");
}
}

View File

@ -8,9 +8,8 @@ import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import org.jetbrains.annotations.NotNull;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Maps;
@ -33,73 +32,137 @@ public class Utils implements Serializable {
private static final ObjectMapper MAPPER = new ObjectMapper();
private static final VerbResolver resolver = VerbResolverFactory.newInstance();
private static final Logger log = LoggerFactory.getLogger(Utils.class);
@FunctionalInterface
private interface ProjectQueryFunction {
String query(int page, int size);
}
@FunctionalInterface
private interface DatasourceQueryFunction {
String query();
}
// PROJECT METHODS
public static CommunityEntityMap getProjectCommunityMap(String baseURL) throws IOException {
CommunityEntityMap projectMap = new CommunityEntityMap();
public static CommunityConfiguration getCommunityConfiguration(String baseURL) throws IOException {
final Map<String, Community> communities = Maps.newHashMap();
List<Community> validCommunities = new ArrayList<>();
getValidCommunities(baseURL)
.forEach(community -> {
addRelevantProjects(community.getId(), baseURL, projectMap);
try {
CommunityModel cm = MAPPER
.readValue(QueryCommunityAPI.community(community.getId(), baseURL), CommunityModel.class);
validCommunities.add(getCommunity(cm));
List<SubCommunityModel> subcommunities = getSubcommunities(community.getId(), baseURL);
subcommunities
.forEach(
sc -> addRelevantProjects(community.getId(), sc.getSubCommunityId(), baseURL, projectMap));
} catch (IOException e) {
throw new RuntimeException(e);
}
});
validCommunities.forEach(community -> {
return projectMap;
}
private static void addRelevantProjects(
String communityId,
String baseURL,
CommunityEntityMap communityEntityMap) {
fetchAndProcessProjects(
(page, size) -> {
try {
return QueryCommunityAPI
.communityProjects(communityId, String.valueOf(page), String.valueOf(size), baseURL);
} catch (IOException e) {
throw new RuntimeException(e);
}
},
communityId,
communityEntityMap);
}
private static void addRelevantProjects(
String communityId,
String subcommunityId,
String baseURL,
CommunityEntityMap communityEntityMap) {
fetchAndProcessProjects(
(page, size) -> {
try {
return QueryCommunityAPI
.subcommunityProjects(
communityId, subcommunityId, String.valueOf(page), String.valueOf(size), baseURL);
} catch (IOException e) {
throw new RuntimeException(e);
}
},
communityId,
communityEntityMap);
}
private static void fetchAndProcessProjects(
ProjectQueryFunction projectQueryFunction,
String communityId,
CommunityEntityMap communityEntityMap) {
int page = 0;
final int size = 100;
ContentModel contentModel;
do {
try {
DatasourceList dl = MAPPER
.readValue(
QueryCommunityAPI.communityDatasource(community.getId(), baseURL), DatasourceList.class);
community.setProviders(dl.stream().map(d -> {
if (d.getEnabled() == null || Boolean.FALSE.equals(d.getEnabled()))
return null;
Provider p = new Provider();
p.setOpenaireId(ModelSupport.getIdPrefix(Datasource.class) + "|" + d.getOpenaireId());
p.setSelectionConstraints(d.getSelectioncriteria());
if (p.getSelectionConstraints() != null)
p.getSelectionConstraints().setSelection(resolver);
return p;
})
.filter(Objects::nonNull)
.collect(Collectors.toList()));
String response = projectQueryFunction.query(page, size);
contentModel = MAPPER.readValue(response, ContentModel.class);
if (!contentModel.getContent().isEmpty()) {
contentModel
.getContent()
.forEach(
project -> communityEntityMap
.add(
ModelSupport.getIdPrefix(Project.class) + "|" + project.getOpenaireId(),
communityId));
}
} catch (IOException e) {
throw new RuntimeException(e);
throw new RuntimeException("Error processing projects for community: " + communityId, e);
}
});
validCommunities.forEach(community -> {
if (community.isValid())
communities.put(community.getId(), community);
});
return new CommunityConfiguration(communities);
page++;
} while (!contentModel.getLast());
}
private static Community getCommunity(CommunityModel cm) {
Community c = new Community();
c.setId(cm.getId());
c.setZenodoCommunities(cm.getOtherZenodoCommunities());
if (StringUtils.isNotBlank(cm.getZenodoCommunity()))
c.getZenodoCommunities().add(cm.getZenodoCommunity());
c.setSubjects(cm.getSubjects());
c.getSubjects().addAll(cm.getFos());
c.getSubjects().addAll(cm.getSdg());
if (cm.getAdvancedConstraints() != null) {
c.setConstraints(cm.getAdvancedConstraints());
c.getConstraints().setSelection(resolver);
private static List<Provider> getCommunityContentProviders(
DatasourceQueryFunction datasourceQueryFunction) {
try {
String response = datasourceQueryFunction.query();
List<CommunityContentprovider> datasourceList = MAPPER
.readValue(response, new TypeReference<List<CommunityContentprovider>>() {
});
return datasourceList.stream().map(d -> {
if (d.getEnabled() == null || Boolean.FALSE.equals(d.getEnabled()))
return null;
Provider p = new Provider();
p.setOpenaireId(ModelSupport.getIdPrefix(Datasource.class) + "|" + d.getOpenaireId());
p.setSelectionConstraints(d.getSelectioncriteria());
if (p.getSelectionConstraints() != null)
p.getSelectionConstraints().setSelection(resolver);
return p;
})
.filter(Objects::nonNull)
.collect(Collectors.toList());
} catch (IOException e) {
throw new RuntimeException("Error processing datasource information: " + e);
}
if (cm.getRemoveConstraints() != null) {
c.setRemoveConstraints(cm.getRemoveConstraints());
c.getRemoveConstraints().setSelection(resolver);
}
return c;
}
/**
* Select the communties with status different from hidden
* @param baseURL the base url of the API to be queried
* @return the list of communities in the CommunityModel class
* @throws IOException
*/
public static List<CommunityModel> getValidCommunities(String baseURL) throws IOException {
return MAPPER
.readValue(QueryCommunityAPI.communities(baseURL), CommunitySummary.class)
List<CommunityModel> listCommunity = MAPPER
.readValue(QueryCommunityAPI.communities(baseURL), new TypeReference<List<CommunityModel>>() {
});
return listCommunity
.stream()
.filter(
community -> !community.getStatus().equals("hidden") &&
@ -107,108 +170,217 @@ public class Utils implements Serializable {
.collect(Collectors.toList());
}
/**
* Sets the Community information from the replies of the communityAPIs
* @param baseURL the base url of the API to be queried
* @param communityModel the communityModel as replied by the APIs
* @return the community set with information from the community model and for the content providers
*/
private static Community getCommunity(String baseURL, CommunityModel communityModel) {
Community community = getCommunity(communityModel);
community.setProviders(getCommunityContentProviders(() -> {
try {
return QueryCommunityAPI.communityDatasource(community.getId(), baseURL);
} catch (IOException e) {
throw new RuntimeException(e);
}
}));
return community;
}
/**
* extends the community configuration for the subcommunity by adding the content providers
* @param baseURL
* @param communityId
* @param sc
* @return
*/
private static @NotNull Community getSubCommunityConfiguration(String baseURL, String communityId,
SubCommunityModel sc) {
Community c = getCommunity(sc);
c.setProviders(getCommunityContentProviders(() -> {
try {
return QueryCommunityAPI.subcommunityDatasource(communityId, sc.getSubCommunityId(), baseURL);
} catch (IOException e) {
throw new RuntimeException(e);
}
}));
return c;
}
/**
* Gets all the sub-comminities fir a given community identifier
* @param communityId
* @param baseURL
* @return
*/
private static List<Community> getSubCommunity(String communityId, String baseURL) {
try {
List<SubCommunityModel> subcommunities = getSubcommunities(communityId, baseURL);
return subcommunities
.stream()
.map(sc -> getSubCommunityConfiguration(baseURL, communityId, sc))
.collect(Collectors.toList());
} catch (IOException e) {
throw new RuntimeException(e);
}
}
/**
* prepare the configuration for the communities and sub-communities
* @param baseURL
* @return
* @throws IOException
*/
public static CommunityConfiguration getCommunityConfiguration(String baseURL) throws IOException {
final Map<String, Community> communities = Maps.newHashMap();
List<CommunityModel> communityList = getValidCommunities(baseURL);
List<Community> validCommunities = new ArrayList<>();
communityList.forEach(community -> {
validCommunities.add(getCommunity(baseURL, community));
validCommunities.addAll(getSubCommunity(community.getId(), baseURL));
});
validCommunities.forEach(community -> {
if (community.isValid())
communities.put(community.getId(), community);
});
return new CommunityConfiguration(communities);
}
/**
* filles the common fields in the community model for both the communityconfiguration and the subcommunityconfiguration
* @param input
* @return
* @param <C>
*/
private static <C extends CommonConfigurationModel> Community getCommonConfiguration(C input) {
Community c = new Community();
c.setZenodoCommunities(input.getOtherZenodoCommunities());
if (StringUtils.isNotBlank(input.getZenodoCommunity()))
c.getZenodoCommunities().add(input.getZenodoCommunity());
c.setSubjects(input.getSubjects());
if (input.getFos() != null)
c.getSubjects().addAll(input.getFos());
if (input.getSdg() != null)
c.getSubjects().addAll(input.getSdg());
if (input.getAdvancedConstraints() != null) {
c.setConstraints(input.getAdvancedConstraints());
c.getConstraints().setSelection(resolver);
}
if (input.getRemoveConstraints() != null) {
c.setRemoveConstraints(input.getRemoveConstraints());
c.getRemoveConstraints().setSelection(resolver);
}
return c;
}
private static Community getCommunity(SubCommunityModel sc) {
Community c = getCommonConfiguration(sc);
c.setId(sc.getSubCommunityId());
return c;
}
private static Community getCommunity(CommunityModel cm) {
Community c = getCommonConfiguration(cm);
c.setId(cm.getId());
return c;
}
public static List<SubCommunityModel> getSubcommunities(String communityId, String baseURL) throws IOException {
return MAPPER
.readValue(
QueryCommunityAPI.subcommunities(communityId, baseURL), new TypeReference<List<SubCommunityModel>>() {
});
}
public static CommunityEntityMap getOrganizationCommunityMap(String baseURL) throws IOException {
return MAPPER
.readValue(QueryCommunityAPI.propagationOrganizationCommunityMap(baseURL), CommunityEntityMap.class);
}
public static CommunityEntityMap getDatasourceCommunityMap(String baseURL) throws IOException {
return MAPPER.readValue(QueryCommunityAPI.propagationDatasourceCommunityMap(baseURL), CommunityEntityMap.class);
}
private static void getRelatedOrganizations(String communityId, String baseURL,
CommunityEntityMap communityEntityMap) {
try {
List<String> associatedOrgs = MAPPER
.readValue(
QueryCommunityAPI.communityPropagationOrganization(communityId, baseURL),
EntityIdentifierList.class);
associatedOrgs
.forEach(
o -> communityEntityMap.add(ModelSupport.getIdPrefix(Organization.class) + "|" + o, communityId));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
private static void getRelatedOrganizations(String communityId, String subcommunityId, String baseURL,
CommunityEntityMap communityEntityMap) {
try {
List<String> associatedOrgs = MAPPER
.readValue(
QueryCommunityAPI.subcommunityPropagationOrganization(communityId, subcommunityId, baseURL),
EntityIdentifierList.class);
associatedOrgs
.forEach(
o -> communityEntityMap.add(ModelSupport.getIdPrefix(Organization.class) + "|" + o, communityId));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
/**
* it returns for each organization the list of associated communities
*/
public static CommunityEntityMap getCommunityOrganization(String baseURL) throws IOException {
CommunityEntityMap organizationMap = new CommunityEntityMap();
String entityPrefix = ModelSupport.getIdPrefix(Organization.class);
getValidCommunities(baseURL)
.forEach(community -> {
String id = community.getId();
try {
List<String> associatedOrgs = MAPPER
.readValue(
QueryCommunityAPI.communityPropagationOrganization(id, baseURL), OrganizationList.class);
associatedOrgs.forEach(o -> {
if (!organizationMap
.keySet()
.contains(
entityPrefix + "|" + o))
organizationMap.put(entityPrefix + "|" + o, new ArrayList<>());
organizationMap.get(entityPrefix + "|" + o).add(community.getId());
});
} catch (IOException e) {
throw new RuntimeException(e);
}
});
return organizationMap;
}
public static CommunityEntityMap getCommunityProjects(String baseURL) throws IOException {
CommunityEntityMap projectMap = new CommunityEntityMap();
String entityPrefix = ModelSupport.getIdPrefix(Project.class);
getValidCommunities(baseURL)
.forEach(community -> {
int page = -1;
int size = 100;
ContentModel cm = new ContentModel();
do {
page++;
try {
cm = MAPPER
.readValue(
QueryCommunityAPI
.communityProjects(
community.getId(), String.valueOf(page), String.valueOf(size), baseURL),
ContentModel.class);
if (cm.getContent().size() > 0) {
cm.getContent().forEach(p -> {
if (!projectMap.keySet().contains(entityPrefix + "|" + p.getOpenaireId()))
projectMap.put(entityPrefix + "|" + p.getOpenaireId(), new ArrayList<>());
projectMap.get(entityPrefix + "|" + p.getOpenaireId()).add(community.getId());
});
}
} catch (IOException e) {
throw new RuntimeException(e);
}
} while (!cm.getLast());
});
return projectMap;
}
public static List<String> getCommunityIdList(String baseURL) throws IOException {
return getValidCommunities(baseURL)
.stream()
.map(community -> community.getId())
.collect(Collectors.toList());
}
public static List<EntityCommunities> getDatasourceCommunities(String baseURL) throws IOException {
List<CommunityModel> validCommunities = getValidCommunities(baseURL);
HashMap<String, Set<String>> map = new HashMap<>();
String entityPrefix = ModelSupport.getIdPrefix(Datasource.class) + "|";
validCommunities.forEach(c -> {
List<CommunityModel> communityList = getValidCommunities(baseURL);
communityList.forEach(community -> {
getRelatedOrganizations(community.getId(), baseURL, organizationMap);
try {
new ObjectMapper()
.readValue(QueryCommunityAPI.communityDatasource(c.getId(), baseURL), DatasourceList.class)
.forEach(d -> {
if (!map.keySet().contains(d.getOpenaireId()))
map.put(d.getOpenaireId(), new HashSet<>());
map.get(d.getOpenaireId()).add(c.getId());
});
List<SubCommunityModel> subcommunities = getSubcommunities(community.getId(), baseURL);
subcommunities
.forEach(
sc -> getRelatedOrganizations(
community.getId(), sc.getSubCommunityId(), baseURL, organizationMap));
} catch (IOException e) {
throw new RuntimeException(e);
}
});
List<EntityCommunities> temp = map
.keySet()
.stream()
.map(k -> EntityCommunities.newInstance(entityPrefix + k, getCollect(k, map)))
.collect(Collectors.toList());
return temp;
return organizationMap;
}
@NotNull
private static List<String> getCollect(String k, HashMap<String, Set<String>> map) {
List<String> temp = map.get(k).stream().collect(Collectors.toList());
return temp;
public static List<String> getCommunityIdList(String baseURL) throws IOException {
return getValidCommunities(baseURL)
.stream()
.flatMap(communityModel -> {
List<String> communityIds = new ArrayList<>();
communityIds.add(communityModel.getId());
try {
Utils
.getSubcommunities(communityModel.getId(), baseURL)
.forEach(sc -> communityIds.add(sc.getSubCommunityId()));
} catch (IOException e) {
throw new RuntimeException(e);
}
return communityIds.stream();
})
.collect(Collectors.toList());
}
}

View File

@ -0,0 +1,76 @@
package eu.dnetlib.dhp.api.model;
import java.io.Serializable;
import java.util.List;
import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import eu.dnetlib.dhp.bulktag.community.SelectionConstraints;
@JsonIgnoreProperties(ignoreUnknown = true)
public class CommonConfigurationModel implements Serializable {
private String zenodoCommunity;
private List<String> subjects;
private List<String> otherZenodoCommunities;
private List<String> fos;
private List<String> sdg;
private SelectionConstraints advancedConstraints;
private SelectionConstraints removeConstraints;
public String getZenodoCommunity() {
return zenodoCommunity;
}
public void setZenodoCommunity(String zenodoCommunity) {
this.zenodoCommunity = zenodoCommunity;
}
public List<String> getSubjects() {
return subjects;
}
public void setSubjects(List<String> subjects) {
this.subjects = subjects;
}
public List<String> getOtherZenodoCommunities() {
return otherZenodoCommunities;
}
public void setOtherZenodoCommunities(List<String> otherZenodoCommunities) {
this.otherZenodoCommunities = otherZenodoCommunities;
}
public List<String> getFos() {
return fos;
}
public void setFos(List<String> fos) {
this.fos = fos;
}
public List<String> getSdg() {
return sdg;
}
public void setSdg(List<String> sdg) {
this.sdg = sdg;
}
public SelectionConstraints getRemoveConstraints() {
return removeConstraints;
}
public void setRemoveConstraints(SelectionConstraints removeConstraints) {
this.removeConstraints = removeConstraints;
}
public SelectionConstraints getAdvancedConstraints() {
return advancedConstraints;
}
public void setAdvancedConstraints(SelectionConstraints advancedConstraints) {
this.advancedConstraints = advancedConstraints;
}
}

View File

@ -18,4 +18,12 @@ public class CommunityEntityMap extends HashMap<String, List<String>> {
}
return super.get(key);
}
public void add(String key, String value) {
if (!super.containsKey(key)) {
super.put(key, new ArrayList<>());
}
super.get(key).add(value);
}
}

View File

@ -13,75 +13,11 @@ import eu.dnetlib.dhp.bulktag.community.SelectionConstraints;
* @Date 06/10/23
*/
@JsonIgnoreProperties(ignoreUnknown = true)
public class CommunityModel implements Serializable {
public class CommunityModel extends CommonConfigurationModel implements Serializable {
private String id;
private String type;
private String status;
private String zenodoCommunity;
private List<String> subjects;
private List<String> otherZenodoCommunities;
private List<String> fos;
private List<String> sdg;
private SelectionConstraints advancedConstraints;
private SelectionConstraints removeConstraints;
public String getZenodoCommunity() {
return zenodoCommunity;
}
public void setZenodoCommunity(String zenodoCommunity) {
this.zenodoCommunity = zenodoCommunity;
}
public List<String> getSubjects() {
return subjects;
}
public void setSubjects(List<String> subjects) {
this.subjects = subjects;
}
public List<String> getOtherZenodoCommunities() {
return otherZenodoCommunities;
}
public void setOtherZenodoCommunities(List<String> otherZenodoCommunities) {
this.otherZenodoCommunities = otherZenodoCommunities;
}
public List<String> getFos() {
return fos;
}
public void setFos(List<String> fos) {
this.fos = fos;
}
public List<String> getSdg() {
return sdg;
}
public void setSdg(List<String> sdg) {
this.sdg = sdg;
}
public SelectionConstraints getRemoveConstraints() {
return removeConstraints;
}
public void setRemoveConstraints(SelectionConstraints removeConstraints) {
this.removeConstraints = removeConstraints;
}
public SelectionConstraints getAdvancedConstraints() {
return advancedConstraints;
}
public void setAdvancedConstraints(SelectionConstraints advancedConstraints) {
this.advancedConstraints = advancedConstraints;
}
public String getId() {
return id;
}

View File

@ -1,15 +0,0 @@
package eu.dnetlib.dhp.api.model;
import java.io.Serializable;
import java.util.ArrayList;
/**
* @author miriam.baglioni
* @Date 06/10/23
*/
public class CommunitySummary extends ArrayList<CommunityModel> implements Serializable {
public CommunitySummary() {
super();
}
}

View File

@ -1,13 +0,0 @@
package eu.dnetlib.dhp.api.model;
import java.io.Serializable;
import java.util.ArrayList;
import eu.dnetlib.dhp.api.model.CommunityContentprovider;
public class DatasourceList extends ArrayList<CommunityContentprovider> implements Serializable {
public DatasourceList() {
super();
}
}

View File

@ -8,9 +8,9 @@ import java.util.ArrayList;
* @author miriam.baglioni
* @Date 09/10/23
*/
public class OrganizationList extends ArrayList<String> implements Serializable {
public class EntityIdentifierList extends ArrayList<String> implements Serializable {
public OrganizationList() {
public EntityIdentifierList() {
super();
}
}

View File

@ -0,0 +1,19 @@
package eu.dnetlib.dhp.api.model;
import java.io.Serializable;
import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
@JsonIgnoreProperties(ignoreUnknown = true)
public class SubCommunityModel extends CommonConfigurationModel implements Serializable {
private String subCommunityId;
public String getSubCommunityId() {
return subCommunityId;
}
public void setSubCommunityId(String subCommunityId) {
this.subCommunityId = subCommunityId;
}
}

View File

@ -15,10 +15,8 @@ import org.apache.hadoop.fs.Path;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FilterFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -31,6 +29,8 @@ import eu.dnetlib.dhp.api.model.CommunityEntityMap;
import eu.dnetlib.dhp.api.model.EntityCommunities;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.bulktag.community.*;
import eu.dnetlib.dhp.common.action.ReadDatasourceMasterDuplicateFromDB;
import eu.dnetlib.dhp.common.action.model.MasterDuplicate;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
@ -88,6 +88,14 @@ public class SparkBulkTagJob {
log.info("protoMap: {}", temp);
ProtoMap protoMap = new Gson().fromJson(temp, ProtoMap.class);
log.info("pathMap: {}", new Gson().toJson(protoMap));
final String dbUrl = parser.get("dbUrl");
log.info("dbUrl: {}", dbUrl);
final String dbUser = parser.get("dbUser");
log.info("dbUser: {}", dbUser);
final String dbPassword = parser.get("dbPassword");
log.info("dbPassword: {}", dbPassword);
final String hdfsPath = outputPath + "masterDuplicate";
log.info("hdfsPath: {}", hdfsPath);
SparkConf conf = new SparkConf();
CommunityConfiguration cc;
@ -101,7 +109,7 @@ public class SparkBulkTagJob {
cc = CommunityConfigurationFactory.newInstance(taggingConf);
} else {
cc = Utils.getCommunityConfiguration(baseURL);
log.info(OBJECT_MAPPER.writeValueAsString(cc));
}
runWithSparkSession(
@ -109,20 +117,98 @@ public class SparkBulkTagJob {
isSparkSessionManaged,
spark -> {
extendCommunityConfigurationForEOSC(spark, inputPath, cc);
ReadDatasourceMasterDuplicateFromDB.execute(dbUrl, dbUser, dbPassword, hdfsPath, hdfsNameNode);
execBulkTag(
spark, inputPath, outputPath, protoMap, cc);
execEntityTag(
spark, inputPath + "organization", outputPath + "organization",
Utils.getCommunityOrganization(baseURL), Organization.class, TaggingConstants.CLASS_ID_ORGANIZATION,
mapWithRepresentativeOrganization(
spark, inputPath + "relation", Utils.getOrganizationCommunityMap(baseURL)),
Organization.class, TaggingConstants.CLASS_ID_ORGANIZATION,
TaggingConstants.CLASS_NAME_BULKTAG_ORGANIZATION);
execEntityTag(
spark, inputPath + "project", outputPath + "project", Utils.getCommunityProjects(baseURL),
spark, inputPath + "project", outputPath + "project",
Utils.getProjectCommunityMap(baseURL),
Project.class, TaggingConstants.CLASS_ID_PROJECT, TaggingConstants.CLASS_NAME_BULKTAG_PROJECT);
execDatasourceTag(spark, inputPath, outputPath, Utils.getDatasourceCommunities(baseURL));
execEntityTag(
spark, inputPath + "datasource", outputPath + "datasource",
mapWithMasterDatasource(spark, hdfsPath, Utils.getDatasourceCommunityMap(baseURL)),
Datasource.class, TaggingConstants.CLASS_ID_DATASOURCE,
TaggingConstants.CLASS_NAME_BULKTAG_DATASOURCE);
});
}
private static CommunityEntityMap mapWithMasterDatasource(SparkSession spark, String masterDuplicatePath,
CommunityEntityMap datasourceCommunityMap) {
// load master-duplicate relations
Dataset<MasterDuplicate> masterDuplicate = spark
.read()
.schema(Encoders.bean(MasterDuplicate.class).schema())
.json(masterDuplicatePath)
.as(Encoders.bean(MasterDuplicate.class));
// list of id for the communities related entities
List<String> idList = entityIdList(ModelSupport.idPrefixMap.get(Datasource.class), datasourceCommunityMap);
// find the mapping with the representative entity if any
Dataset<String> datasourceIdentifiers = spark.createDataset(idList, Encoders.STRING());
List<Row> mappedKeys = datasourceIdentifiers
.join(
masterDuplicate, datasourceIdentifiers.col("_1").equalTo(masterDuplicate.col("duplicateId")),
"left_semi")
.selectExpr("masterId as source", "duplicateId as target")
.collectAsList();
// remap the entity with its corresponding representative
return remapCommunityEntityMap(datasourceCommunityMap, mappedKeys);
}
private static List<String> entityIdList(String idPrefixMap, CommunityEntityMap datasourceCommunityMap) {
final String prefix = idPrefixMap + "|";
return datasourceCommunityMap
.keySet()
.stream()
.map(key -> prefix + key)
.collect(Collectors.toList());
}
private static CommunityEntityMap mapWithRepresentativeOrganization(SparkSession spark, String relationPath,
CommunityEntityMap organizationCommunityMap) {
Dataset<Row> mergesRel = spark
.read()
.schema(Encoders.bean(Relation.class).schema())
.json(relationPath)
.filter("datainfo.deletedbyinference != true and relClass = 'merges")
.select("source", "target");
List<String> idList = entityIdList(ModelSupport.idPrefixMap.get(Organization.class), organizationCommunityMap);
Dataset<String> organizationIdentifiers = spark.createDataset(idList, Encoders.STRING());
List<Row> mappedKeys = organizationIdentifiers
.join(mergesRel, organizationIdentifiers.col("_1").equalTo(mergesRel.col("target")), "left_semi")
.select("source", "target")
.collectAsList();
return remapCommunityEntityMap(organizationCommunityMap, mappedKeys);
}
private static CommunityEntityMap remapCommunityEntityMap(CommunityEntityMap entityCommunityMap,
List<Row> mappedKeys) {
for (Row mappedEntry : mappedKeys) {
String oldKey = mappedEntry.getAs("target");
String newKey = mappedEntry.getAs("source");
// inserts the newKey in the map while removing the oldKey. The remove produces the value in the Map, which
// will be used as the newValue parameter of the BiFunction
entityCommunityMap.merge(newKey, entityCommunityMap.remove(oldKey), (existing, newValue) -> {
existing.addAll(newValue);
return existing;
});
}
return entityCommunityMap;
}
private static <E extends OafEntity> void execEntityTag(SparkSession spark, String inputPath, String outputPath,
CommunityEntityMap communityEntity, Class<E> entityClass,
String classID, String calssName) {
@ -184,63 +270,6 @@ public class SparkBulkTagJob {
.json(inputPath);
}
private static void execDatasourceTag(SparkSession spark, String inputPath, String outputPath,
List<EntityCommunities> datasourceCommunities) {
Dataset<Datasource> datasource = readPath(spark, inputPath + "datasource", Datasource.class);
Dataset<EntityCommunities> dc = spark
.createDataset(datasourceCommunities, Encoders.bean(EntityCommunities.class));
datasource
.joinWith(dc, datasource.col("id").equalTo(dc.col("entityId")), "left")
.map((MapFunction<Tuple2<Datasource, EntityCommunities>, Datasource>) t2 -> {
Datasource ds = t2._1();
if (t2._2() != null) {
List<String> context = Optional
.ofNullable(ds.getContext())
.map(v -> v.stream().map(c -> c.getId()).collect(Collectors.toList()))
.orElse(new ArrayList<>());
if (!Optional.ofNullable(ds.getContext()).isPresent())
ds.setContext(new ArrayList<>());
t2._2().getCommunitiesId().forEach(c -> {
if (!context.contains(c)) {
Context con = new Context();
con.setId(c);
con
.setDataInfo(
Arrays
.asList(
OafMapperUtils
.dataInfo(
false, TaggingConstants.BULKTAG_DATA_INFO_TYPE, true, false,
OafMapperUtils
.qualifier(
TaggingConstants.CLASS_ID_DATASOURCE,
TaggingConstants.CLASS_NAME_BULKTAG_DATASOURCE,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
"1")));
ds.getContext().add(con);
}
});
}
return ds;
}, Encoders.bean(Datasource.class))
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputPath + "datasource");
readPath(spark, outputPath + "datasource", Datasource.class)
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(inputPath + "datasource");
}
private static void extendCommunityConfigurationForEOSC(SparkSession spark, String inputPath,
CommunityConfiguration cc) {
@ -278,11 +307,6 @@ public class SparkBulkTagJob {
ProtoMap protoMappingParams,
CommunityConfiguration communityConfiguration) {
try {
System.out.println(new ObjectMapper().writeValueAsString(protoMappingParams));
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}
ModelSupport.entityTypes
.keySet()
.parallelStream()

View File

@ -43,7 +43,8 @@ public class Community implements Serializable {
}
public void setSubjects(List<String> subjects) {
this.subjects = subjects;
if (subjects != null)
this.subjects = subjects;
}
public List<Provider> getProviders() {
@ -59,7 +60,8 @@ public class Community implements Serializable {
}
public void setZenodoCommunities(List<String> zenodoCommunities) {
this.zenodoCommunities = zenodoCommunities;
if (zenodoCommunities != null)
this.zenodoCommunities = zenodoCommunities;
}
public SelectionConstraints getConstraints() {

View File

@ -130,6 +130,7 @@ public class ResultTagger implements Serializable {
// log.info("Remove constraints for " + communityId);
if (conf.getRemoveConstraintsMap().keySet().contains(communityId) &&
conf.getRemoveConstraintsMap().get(communityId).getCriteria() != null &&
!conf.getRemoveConstraintsMap().get(communityId).getCriteria().isEmpty() &&
conf
.getRemoveConstraintsMap()
.get(communityId)
@ -161,29 +162,30 @@ public class ResultTagger implements Serializable {
// Tagging for datasource
final Set<String> datasources = new HashSet<>();
final Set<String> collfrom = new HashSet<>();
final Set<String> cfhb = new HashSet<>();
final Set<String> hostdby = new HashSet<>();
if (Objects.nonNull(result.getInstance())) {
for (Instance i : result.getInstance()) {
if (Objects.nonNull(i.getCollectedfrom()) && Objects.nonNull(i.getCollectedfrom().getKey())) {
collfrom.add(i.getCollectedfrom().getKey());
cfhb.add(i.getCollectedfrom().getKey());
}
if (Objects.nonNull(i.getHostedby()) && Objects.nonNull(i.getHostedby().getKey())) {
cfhb.add(i.getHostedby().getKey());
hostdby.add(i.getHostedby().getKey());
}
}
collfrom
cfhb
.forEach(
dsId -> datasources
.addAll(
conf.getCommunityForDatasource(dsId, param)));
hostdby.forEach(dsId -> {
datasources
.addAll(
conf.getCommunityForDatasource(dsId, param));
// datasources
// .addAll(
// conf.getCommunityForDatasource(dsId, param));
if (conf.isEoscDatasource(dsId)) {
datasources.add("eosc");
}
@ -226,6 +228,7 @@ public class ResultTagger implements Serializable {
.forEach(communityId -> {
if (!removeCommunities.contains(communityId) &&
conf.getSelectionConstraintsMap().get(communityId).getCriteria() != null &&
!conf.getSelectionConstraintsMap().get(communityId).getCriteria().isEmpty() &&
conf
.getSelectionConstraintsMap()
.get(communityId)

View File

@ -33,6 +33,8 @@ public class SelectionConstraints implements Serializable {
// Constraints in or
public boolean verifyCriteria(final Map<String, List<String>> param) {
if (criteria.isEmpty())
return true;
for (Constraints selc : criteria) {
if (selc.verifyCriteria(param)) {
return true;

View File

@ -12,10 +12,7 @@ import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -84,10 +81,12 @@ public class SparkCountryPropagationJob {
Dataset<R> res = readPath(spark, sourcePath, resultClazz);
log.info("Reading prepared info: {}", preparedInfoPath);
Encoder<ResultCountrySet> rcsEncoder = Encoders.bean(ResultCountrySet.class);
Dataset<ResultCountrySet> prepared = spark
.read()
.schema(rcsEncoder.schema())
.json(preparedInfoPath)
.as(Encoders.bean(ResultCountrySet.class));
.as(rcsEncoder);
res
.joinWith(prepared, res.col("id").equalTo(prepared.col("resultId")), "left_outer")

View File

@ -52,6 +52,7 @@ public class PrepareResultCommunitySet {
log.info("baseURL: {}", baseURL);
final CommunityEntityMap organizationMap = Utils.getCommunityOrganization(baseURL);
// final CommunityEntityMap organizationMap = Utils.getOrganizationCommunityMap(baseURL);
log.info("organizationMap: {}", new Gson().toJson(organizationMap));
SparkConf conf = new SparkConf();

View File

@ -2,13 +2,11 @@
package eu.dnetlib.dhp.resulttocommunityfromproject;
import static eu.dnetlib.dhp.PropagationConstant.*;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.util.*;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.api.java.function.MapGroupsFunction;
@ -18,16 +16,10 @@ import org.apache.spark.sql.types.StructType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.gson.Gson;
import eu.dnetlib.dhp.api.Utils;
import eu.dnetlib.dhp.api.model.CommunityEntityMap;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.resulttocommunityfromorganization.ResultCommunityList;
import eu.dnetlib.dhp.resulttocommunityfromorganization.ResultOrganizations;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Relation;
import scala.Tuple2;
public class PrepareResultCommunitySet {
@ -55,7 +47,7 @@ public class PrepareResultCommunitySet {
final String baseURL = parser.get("baseURL");
log.info("baseURL: {}", baseURL);
final CommunityEntityMap projectsMap = Utils.getCommunityProjects(baseURL);
final CommunityEntityMap projectsMap = Utils.getProjectCommunityMap(baseURL);
SparkConf conf = new SparkConf();

View File

@ -1,11 +1,14 @@
package eu.dnetlib.dhp.resulttocommunityfromsemrel;
import static java.lang.String.join;
import static eu.dnetlib.dhp.PropagationConstant.*;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import org.apache.commons.io.IOUtils;
@ -19,6 +22,7 @@ import com.google.gson.Gson;
import eu.dnetlib.dhp.api.Utils;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.resulttocommunityfromorganization.ResultCommunityList;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.Result;
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
@ -45,7 +49,7 @@ public class PrepareResultCommunitySetStep1 {
/**
* a dataset for example could be linked to more than one publication. For each publication linked to that dataset
* the previous query will produce a row: targetId set of community context the target could possibly inherit with
* the previous query will produce a row: targetId, set of community context the target could possibly inherit. With
* the following query there will be a single row for each result linked to more than one result of the result type
* currently being used
*/
@ -56,6 +60,27 @@ public class PrepareResultCommunitySetStep1 {
+ "where length(co) > 0 "
+ "group by resultId";
private static final String RESULT_CONTEXT_QUERY_TEMPLATE_IS_RELATED_TO = "select target as resultId, community_context "
+
"from resultWithContext rwc " +
"join relatedToRelations r " +
"join patents p " +
"on rwc.id = r.source and r.target = p.id";
private static final String RESULT_WITH_CONTEXT = "select id, collect_set(co.id) community_context \n" +
" from result " +
" lateral view explode (context) c as co " +
" where lower(co.id) IN %s" +
" group by id";
private static final String RESULT_PATENT = "select id " +
" from result " +
" where array_contains(instance.instancetype.classname, 'Patent')";
private static final String IS_RELATED_TO_RELATIONS = "select source, target " +
" from relation " +
" where lower(relClass) = 'isrelatedto' and datainfo.deletedbyinference = false";
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
.toString(
@ -82,14 +107,25 @@ public class PrepareResultCommunitySetStep1 {
SparkConf conf = new SparkConf();
conf.set("hive.metastore.uris", parser.get("hive_metastore_uris"));
final List<String> allowedsemrel = Arrays.asList(parser.get("allowedsemrels").split(";"));
log.info("allowedSemRel: {}", new Gson().toJson(allowedsemrel));
final String allowedsemrel = "(" + join(
",",
Arrays
.asList(parser.get("allowedsemrels").split(";"))
.stream()
.map(value -> "'" + value.toLowerCase() + "'")
.toArray(String[]::new))
+ ")";
log.info("allowedSemRel: {}", allowedsemrel);
final String baseURL = parser.get("baseURL");
log.info("baseURL: {}", baseURL);
final List<String> communityIdList = getCommunityList(baseURL);
log.info("communityIdList: {}", new Gson().toJson(communityIdList));
final String communityIdList = "(" + join(
",", getCommunityList(baseURL)
.stream()
.map(value -> "'" + value.toLowerCase() + "'")
.toArray(String[]::new))
+ ")";
final String resultType = resultClassName.substring(resultClassName.lastIndexOf(".") + 1).toLowerCase();
log.info("resultType: {}", resultType);
@ -118,10 +154,10 @@ public class PrepareResultCommunitySetStep1 {
SparkSession spark,
String inputPath,
String outputPath,
List<String> allowedsemrel,
String allowedsemrel,
Class<R> resultClazz,
String resultType,
List<String> communityIdList) {
String communityIdList) {
final String inputResultPath = inputPath + "/" + resultType;
log.info("Reading Graph table from: {}", inputResultPath);
@ -132,7 +168,8 @@ public class PrepareResultCommunitySetStep1 {
Dataset<Relation> relation = readPath(spark, inputRelationPath, Relation.class);
relation.createOrReplaceTempView("relation");
Dataset<R> result = readPath(spark, inputResultPath, resultClazz);
Dataset<R> result = readPath(spark, inputResultPath, resultClazz)
.where("datainfo.deletedbyinference != true AND datainfo.invisible != true");
result.createOrReplaceTempView("result");
final String outputResultPath = outputPath + "/" + resultType;
@ -141,10 +178,20 @@ public class PrepareResultCommunitySetStep1 {
String resultContextQuery = String
.format(
RESULT_CONTEXT_QUERY_TEMPLATE,
getConstraintList(" lower(co.id) = '", communityIdList),
getConstraintList(" lower(relClass) = '", allowedsemrel));
"AND lower(co.id) IN " + communityIdList,
"AND lower(relClass) IN " + allowedsemrel);
Dataset<Row> result_context = spark.sql(resultContextQuery);
Dataset<Row> rwc = spark.sql(String.format(RESULT_WITH_CONTEXT, communityIdList));
Dataset<Row> patents = spark.sql(RESULT_PATENT);
Dataset<Row> relatedToRelations = spark.sql(IS_RELATED_TO_RELATIONS);
rwc.createOrReplaceTempView("resultWithContext");
patents.createOrReplaceTempView("patents");
relatedToRelations.createOrReplaceTempView("relatedTorelations");
result_context = result_context.unionAll(spark.sql(RESULT_CONTEXT_QUERY_TEMPLATE_IS_RELATED_TO));
result_context.createOrReplaceTempView("result_context");
spark
@ -152,8 +199,9 @@ public class PrepareResultCommunitySetStep1 {
.as(Encoders.bean(ResultCommunityList.class))
.write()
.option("compression", "gzip")
.mode(SaveMode.Overwrite)
.mode(SaveMode.Append)
.json(outputResultPath);
}
public static List<String> getCommunityList(final String baseURL) throws IOException {

View File

@ -4,6 +4,7 @@ package eu.dnetlib.dhp.resulttocommunityfromsemrel;
import static eu.dnetlib.dhp.PropagationConstant.*;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Set;
@ -76,22 +77,13 @@ public class PrepareResultCommunitySetStep2 {
if (b == null) {
return a;
}
Set<String> community_set = new HashSet<>();
a.getCommunityList().stream().forEach(aa -> community_set.add(aa));
b
.getCommunityList()
.stream()
.forEach(
aa -> {
if (!community_set.contains(aa)) {
a.getCommunityList().add(aa);
community_set.add(aa);
}
});
Set<String> community_set = new HashSet<>(a.getCommunityList());
community_set.addAll(b.getCommunityList());
a.setCommunityList(new ArrayList<>(community_set));
return a;
})
.map(Tuple2::_2)
.map(r -> OBJECT_MAPPER.writeValueAsString(r))
.map(OBJECT_MAPPER::writeValueAsString)
.saveAsTextFile(outputPath, GzipCodec.class);
}

View File

@ -28,4 +28,7 @@ blacklist=empty
allowedpids=orcid;orcid_pending
baseURL = https://services.openaire.eu/openaire/community/
iterations=1
dbUrl=jdbc:postgresql://beta.services.openaire.eu:5432/dnet_openaireplus
dbUser=dnet
dbPassword=dnetPwd

View File

@ -170,6 +170,18 @@
<name>pathMap</name>
<value>${pathMap}</value>
</property>
<property>
<name>dbUrl</name>
<value>${dbUrl}</value>
</property>
<property>
<name>dbUser</name>
<value>${dbUser}</value>
</property>
<property>
<name>dbPassword</name>
<value>${dbPassword}</value>
</property>
</configuration>
</sub-workflow>
<ok to="affiliation_inst_repo" />

View File

@ -39,5 +39,22 @@
"paramLongName": "nameNode",
"paramDescription": "this parameter is to specify the api to be queried (beta or production)",
"paramRequired": true
},
{
"paramName": "du",
"paramLongName": "dbUrl",
"paramDescription": "this parameter is to specify the api to be queried (beta or production)",
"paramRequired": true
},{
"paramName": "dus",
"paramLongName": "dbUser",
"paramDescription": "this parameter is to specify the api to be queried (beta or production)",
"paramRequired": true
},{
"paramName": "dp",
"paramLongName": "dbPassword",
"paramDescription": "this parameter is to specify the api to be queried (beta or production)",
"paramRequired": true
}
]

View File

@ -16,6 +16,18 @@
<name>startFrom></name>
<value>undelete</value>
</property>
<property>
<name>dbUrl</name>
</property>
<property>
<name>dbUser</name>
</property>
<property>
<name>dbPassword</name>
</property>
</parameters>
@ -77,6 +89,9 @@
<arg>--pathMap</arg><arg>${pathMap}</arg>
<arg>--baseURL</arg><arg>${baseURL}</arg>
<arg>--nameNode</arg><arg>${nameNode}</arg>
<arg>--dbUrl</arg><arg>${dbUrl}</arg>
<arg>--dbUser</arg><arg>${dbUser}</arg>
<arg>--dbPassword</arg><arg>${dbPassword}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -6,7 +6,6 @@ import static eu.dnetlib.dhp.bulktag.community.TaggingConstants.ZENODO_COMMUNITY
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;
@ -18,7 +17,6 @@ import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FilterFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
@ -29,9 +27,13 @@ import org.junit.jupiter.api.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.gson.Gson;
import eu.dnetlib.dhp.api.Utils;
import eu.dnetlib.dhp.api.model.SubCommunityModel;
import eu.dnetlib.dhp.bulktag.community.CommunityConfiguration;
import eu.dnetlib.dhp.bulktag.community.ProtoMap;
import eu.dnetlib.dhp.schema.oaf.*;
@ -1949,4 +1951,21 @@ public class BulkTagJobTest {
}
@Test
public void testApi() throws IOException {
String baseURL = "https://dev-openaire.d4science.org/openaire/community/";
List<SubCommunityModel> subcommunities = Utils.getSubcommunities("clarin", baseURL);
CommunityConfiguration tmp = Utils.getCommunityConfiguration(baseURL);
tmp.getCommunities().keySet().forEach(c -> {
try {
System.out.println(new ObjectMapper().writeValueAsString(tmp.getCommunities().get(c)));
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}
});
System.out.println(new ObjectMapper().writeValueAsString(Utils.getOrganizationCommunityMap(baseURL)));
}
}

View File

@ -6,7 +6,9 @@ import static org.apache.spark.sql.functions.desc;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import org.apache.commons.io.FileUtils;
import org.apache.spark.SparkConf;
@ -24,7 +26,9 @@ import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.resulttocommunityfromorganization.ResultCommunityList;
import eu.dnetlib.dhp.schema.oaf.Dataset;
import scala.collection.Seq;
public class ResultToCommunityJobTest {
@ -271,4 +275,59 @@ public class ResultToCommunityJobTest {
.get(0)
.getString(0));
}
@Test
public void prepareStep1Test() throws Exception {
/*
* final String allowedsemrel = join(",", Arrays.stream(parser.get("allowedsemrels").split(";")) .map(value ->
* "'" + value.toLowerCase() + "'") .toArray(String[]::new)); log.info("allowedSemRel: {}", new
* Gson().toJson(allowedsemrel)); final String baseURL = parser.get("baseURL"); log.info("baseURL: {}",
* baseURL);
*/
PrepareResultCommunitySetStep1
.main(
new String[] {
"-isSparkSessionManaged", Boolean.FALSE.toString(),
"-sourcePath", getClass()
.getResource("/eu/dnetlib/dhp/resulttocommunityfromsemrel/graph")
.getPath(),
"-hive_metastore_uris", "",
"-resultTableName", "eu.dnetlib.dhp.schema.oaf.Publication",
"-outputPath", workingDir.toString() + "/preparedInfo",
"-allowedsemrels", "issupplementto;issupplementedby",
"-baseURL", "https://dev-openaire.d4science.org/openaire/community/"
});
org.apache.spark.sql.Dataset<ResultCommunityList> resultCommunityList = spark
.read()
.schema(Encoders.bean(ResultCommunityList.class).schema())
.json(workingDir.toString() + "/preparedInfo/publication")
.as(Encoders.bean(ResultCommunityList.class));
Assertions.assertEquals(2, resultCommunityList.count());
Assertions
.assertEquals(
1,
resultCommunityList.filter("resultId = '50|dedup_wf_001::06e51d2bf295531b2d2e7a1b55500783'").count());
Assertions
.assertEquals(
1,
resultCommunityList.filter("resultId = '50|pending_org_::82f63b2d21ae88596b9d8991780e9888'").count());
ArrayList<String> communities = resultCommunityList
.filter("resultId = '50|dedup_wf_001::06e51d2bf295531b2d2e7a1b55500783'")
.first()
.getCommunityList();
Assertions.assertEquals(2, communities.size());
Assertions.assertTrue(communities.stream().anyMatch(cid -> "beopen".equals(cid)));
Assertions.assertTrue(communities.stream().anyMatch(cid -> "dh-ch".equals(cid)));
communities = resultCommunityList
.filter("resultId = '50|pending_org_::82f63b2d21ae88596b9d8991780e9888'")
.first()
.getCommunityList();
Assertions.assertEquals(1, communities.size());
Assertions.assertEquals("dh-ch", communities.get(0));
}
}

View File

@ -0,0 +1,24 @@
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"issupplementedby","relType":"resultOrganization","source":"50|355e65625b88::e7d48a470b13bda61f7ebe3513e20cb6","subRelType":"affiliation","target":"50|pending_org_::82f63b2d21ae88596b9d8991780e9888","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"issupplementedby","relType":"resultOrganization","source":"50|355e65625b88::e7d48a470b13bda61f7ebe3513e20cb6","subRelType":"affiliation","target":"50|dedup_wf_001::06e51d2bf295531b2d2e7a1b55500783","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"IsProvidedBy","relType":"resultOrganization","source":"10|opendoar____::f0dd4a99fba6075a9494772b58f95280","subRelType":"affiliation","target":"20|openorgs____::322ff2a6524820640bc5d1311871585e","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"IsProvidedBy","relType":"resultOrganization","source":"10|eurocrisdris::9ae43d14471c4b33661fedda6f06b539","subRelType":"affiliation","target":"20|openorgs____::58e60f1715d219aa6757ba0b0f2ccbce","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"IsProvidedBy","relType":"resultOrganization","target":"20|openorgs____::64badd35233ba2cd4946368ef2f4cf57","subRelType":"affiliation","source":"10|issn___print::a7a2010e75d849442790955162ef4e42","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"IsProvidedBy","relType":"resultOrganization","source":"10|issn___print::a7a2010e75d849442790955162ef4e43","subRelType":"affiliation","target":"20|openorgs____::64badd35233ba2cd4946368ef2f4cf57","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"IsProvidedBy","relType":"resultOrganization","source":"10|issn___print::a7a2010e75d849442790955162ef4e44","subRelType":"affiliation","target":"20|openorgs____::548cbb0c5a93722f3a9aa62aa17a1ba1","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"IsProvidedBy","relType":"resultOrganization","source":"10|issn___print::a7a2010e75d849442790955162ef4e45","subRelType":"affiliation","target":"20|pending_org_::c522a7c935f9fd9578122e60eeec282c","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isrelatedto","relType":"resultOrganization","source":"50|openorgs____::64badd35233ba2cd4946368ef2f4cf57","subRelType":"affiliation","target":"50|dedup_wf_001::06e51d2bf295531b2d2e7a1b55500783","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::06e51d2bf295531b2d2e7a1b55500783","subRelType":"affiliation","target":"20|openorgs____::64badd35233ba2cd4946368ef2f4cf57","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isrelatedto","relType":"resultOrganization","source":"50|355e65625b88::74009c567c81b4aa55c813db658734df","subRelType":"affiliation","target":"50|dedup_wf_001::08d6f2001319c86d0e69b0f83ad75df2","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::08d6f2001319c86d0e69b0f83ad75df2","subRelType":"affiliation","target":"20|openorgs____::91a81877815afb4ebf25c1a3f3b03c5d","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isAuthorInstitutionOf","relType":"resultOrganization","source":"20|openorgs____::548cbb0c5a93722f3a9aa62aa17a1ba1","subRelType":"affiliation","target":"50|dedup_wf_001::0a1cdf269375d32ce341fdeb0e92dfa8","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::0a1cdf269375d32ce341fdeb0e92dfa8","subRelType":"affiliation","target":"20|openorgs____::548cbb0c5a93722f3a9aa62aa17a1ba1","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isAuthorInstitutionOf","relType":"resultOrganization","source":"20|pending_org_::a50fdd7f7e77b74ea2b16823151c391a","subRelType":"affiliation","target":"50|dedup_wf_001::0ab92bed024ee6883c7a1244722e5eec","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::0ab92bed024ee6883c7a1244722e5eec","subRelType":"affiliation","target":"20|pending_org_::a50fdd7f7e77b74ea2b16823151c391a","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isAuthorInstitutionOf","relType":"resultOrganization","source":"20|openorgs____::64badd35233ba2cd4946368ef2f4cf57","subRelType":"affiliation","target":"50|dedup_wf_001::0ca26c736ad4d15b3d5ee90a4d7853e1","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::0ca26c736ad4d15b3d5ee90a4d7853e1","subRelType":"affiliation","target":"20|openorgs____::64badd35233ba2cd4946368ef2f4cf57","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isAuthorInstitutionOf","relType":"resultOrganization","source":"20|pending_org_::a50fdd7f7e77b74ea2b16823151c391a","subRelType":"affiliation","target":"50|dedup_wf_001::0ef8dfab3927cb4d69df0d3113f05a42","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::0ef8dfab3927cb4d69df0d3113f05a42","subRelType":"affiliation","target":"20|pending_org_::a50fdd7f7e77b74ea2b16823151c391a","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isAuthorInstitutionOf","relType":"resultOrganization","source":"20|openorgs____::548cbb0c5a93722f3a9aa62aa17a1ba1","subRelType":"affiliation","target":"50|dedup_wf_001::0f488ad00253126c14a21abe6b2d406c","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::0f488ad00253126c14a21abe6b2d406c","subRelType":"affiliation","target":"20|openorgs____::548cbb0c5a93722f3a9aa62aa17a1ba1","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"isAuthorInstitutionOf","relType":"resultOrganization","source":"20|pending_org_::c522a7c935f9fd9578122e60eeec282c","subRelType":"affiliation","target":"50|dedup_wf_001::12206bf78aabd7d52132477182d19147","validated":false}
{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"propagation","inferred":true,"invisible":false,"provenanceaction":{"classid":"result:organization:instrepo","classname":"Propagation of affiliation to result collected from datasources of type institutional repository","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":"0.85"},"properties":[],"relClass":"hasAuthorInstitution","relType":"resultOrganization","source":"50|dedup_wf_001::12206bf78aabd7d52132477182d19147","subRelType":"affiliation","target":"20|pending_org_::c522a7c935f9fd9578122e60eeec282c","validated":false}

View File

@ -0,0 +1,80 @@
package eu.dnetlib.dhp.oa.graph.hive;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
import java.util.Optional;
import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.schema.oaf.Oaf;
public class GraphHiveTableExporterJob {
private static final Logger log = LoggerFactory.getLogger(GraphHiveTableExporterJob.class);
public static void main(String[] args) throws Exception {
final ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
.toString(
GraphHiveTableExporterJob.class
.getResourceAsStream(
"/eu/dnetlib/dhp/oa/graph/hive_db_exporter_parameters.json")));
parser.parseArgument(args);
Boolean isSparkSessionManaged = Optional
.ofNullable(parser.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
int numPartitions = Optional
.ofNullable(parser.get("numPartitions"))
.map(Integer::valueOf)
.orElse(-1);
log.info("numPartitions: {}", numPartitions);
String outputPath = parser.get("outputPath");
log.info("outputPath: {}", outputPath);
String hiveTableName = parser.get("hiveTableName");
log.info("hiveTableName: {}", hiveTableName);
String hiveMetastoreUris = parser.get("hiveMetastoreUris");
log.info("hiveMetastoreUris: {}", hiveMetastoreUris);
String mode = parser.get("mode");
log.info("mode: {}", mode);
SparkConf conf = new SparkConf();
conf.set("hive.metastore.uris", hiveMetastoreUris);
runWithSparkHiveSession(
conf, isSparkSessionManaged,
spark -> saveGraphTable(spark, outputPath, hiveTableName, mode, numPartitions));
}
// protected for testing
private static <T extends Oaf> void saveGraphTable(SparkSession spark, String outputPath, String hiveTableName,
String mode, int numPartitions) {
Dataset<Row> dataset = spark.table(hiveTableName);
if (numPartitions > 0) {
log.info("repartitioning to {} partitions", numPartitions);
dataset = dataset.repartition(numPartitions);
}
dataset
.write()
.mode(mode)
.option("compression", "gzip")
.json(outputPath);
}
}

View File

@ -153,34 +153,40 @@ public abstract class AbstractMdRecordToOafMapper {
final DataInfo entityInfo = prepareDataInfo(doc, this.invisible);
final long lastUpdateTimestamp = new Date().getTime();
final List<Instance> instances = prepareInstances(doc, entityInfo, collectedFrom, hostedBy);
final Instance instance = prepareInstances(doc, entityInfo, collectedFrom, hostedBy);
final String type = getResultType(doc, instances);
if (!Optional
.ofNullable(instance.getInstancetype())
.map(Qualifier::getClassid)
.filter(StringUtils::isNotBlank)
.isPresent()) {
return Lists.newArrayList();
}
return createOafs(doc, type, instances, collectedFrom, entityInfo, lastUpdateTimestamp);
final String type = getResultType(instance);
return createOafs(doc, type, instance, collectedFrom, entityInfo, lastUpdateTimestamp);
} catch (final DocumentException e) {
log.error("Error with record:\n" + xml);
return Lists.newArrayList();
}
}
protected String getResultType(final Document doc, final List<Instance> instances) {
final String type = doc.valueOf("//dr:CobjCategory/@type");
if (StringUtils.isBlank(type) && this.vocs.vocabularyExists(ModelConstants.DNET_RESULT_TYPOLOGIES)) {
final String instanceType = instances
.stream()
.map(i -> i.getInstancetype().getClassid())
.findFirst()
.filter(s -> !UNKNOWN.equalsIgnoreCase(s))
.orElse("0000"); // Unknown
protected String getResultType(final Instance instance) {
if (this.vocs.vocabularyExists(ModelConstants.DNET_RESULT_TYPOLOGIES)) {
return Optional
.ofNullable(this.vocs.getSynonymAsQualifier(ModelConstants.DNET_RESULT_TYPOLOGIES, instanceType))
.ofNullable(instance.getInstancetype())
.map(Qualifier::getClassid)
.map(
instanceType -> Optional
.ofNullable(
this.vocs.getSynonymAsQualifier(ModelConstants.DNET_RESULT_TYPOLOGIES, instanceType))
.map(Qualifier::getClassid)
.orElse("0000"))
.orElse("0000");
} else {
throw new IllegalStateException("Missing vocabulary: " + ModelConstants.DNET_RESULT_TYPOLOGIES);
}
return type;
}
private KeyValue getProvenanceDatasource(final Document doc, final String xpathId, final String xpathName) {
@ -197,12 +203,12 @@ public abstract class AbstractMdRecordToOafMapper {
protected List<Oaf> createOafs(
final Document doc,
final String type,
final List<Instance> instances,
final Instance instance,
final KeyValue collectedFrom,
final DataInfo info,
final long lastUpdateTimestamp) {
final OafEntity entity = createEntity(doc, type, instances, collectedFrom, info, lastUpdateTimestamp);
final OafEntity entity = createEntity(doc, type, instance, collectedFrom, info, lastUpdateTimestamp);
final Set<String> originalId = Sets.newHashSet(entity.getOriginalId());
originalId.add(entity.getId());
@ -235,19 +241,19 @@ public abstract class AbstractMdRecordToOafMapper {
private OafEntity createEntity(final Document doc,
final String type,
final List<Instance> instances,
final Instance instance,
final KeyValue collectedFrom,
final DataInfo info,
final long lastUpdateTimestamp) {
switch (type.toLowerCase()) {
case "publication":
final Publication p = new Publication();
populateResultFields(p, doc, instances, collectedFrom, info, lastUpdateTimestamp);
populateResultFields(p, doc, instance, collectedFrom, info, lastUpdateTimestamp);
p.setJournal(prepareJournal(doc, info));
return p;
case "dataset":
final Dataset d = new Dataset();
populateResultFields(d, doc, instances, collectedFrom, info, lastUpdateTimestamp);
populateResultFields(d, doc, instance, collectedFrom, info, lastUpdateTimestamp);
d.setStoragedate(prepareDatasetStorageDate(doc, info));
d.setDevice(prepareDatasetDevice(doc, info));
d.setSize(prepareDatasetSize(doc, info));
@ -258,7 +264,7 @@ public abstract class AbstractMdRecordToOafMapper {
return d;
case "software":
final Software s = new Software();
populateResultFields(s, doc, instances, collectedFrom, info, lastUpdateTimestamp);
populateResultFields(s, doc, instance, collectedFrom, info, lastUpdateTimestamp);
s.setDocumentationUrl(prepareSoftwareDocumentationUrls(doc, info));
s.setLicense(prepareSoftwareLicenses(doc, info));
s.setCodeRepositoryUrl(prepareSoftwareCodeRepositoryUrl(doc, info));
@ -268,7 +274,7 @@ public abstract class AbstractMdRecordToOafMapper {
case "otherresearchproducts":
default:
final OtherResearchProduct o = new OtherResearchProduct();
populateResultFields(o, doc, instances, collectedFrom, info, lastUpdateTimestamp);
populateResultFields(o, doc, instance, collectedFrom, info, lastUpdateTimestamp);
o.setContactperson(prepareOtherResearchProductContactPersons(doc, info));
o.setContactgroup(prepareOtherResearchProductContactGroups(doc, info));
o.setTool(prepareOtherResearchProductTools(doc, info));
@ -415,7 +421,7 @@ public abstract class AbstractMdRecordToOafMapper {
private void populateResultFields(
final Result r,
final Document doc,
final List<Instance> instances,
final Instance instance,
final KeyValue collectedFrom,
final DataInfo info,
final long lastUpdateTimestamp) {
@ -449,8 +455,8 @@ public abstract class AbstractMdRecordToOafMapper {
r.setExternalReference(new ArrayList<>()); // NOT PRESENT IN MDSTORES
r.setProcessingchargeamount(field(doc.valueOf("//oaf:processingchargeamount"), info));
r.setProcessingchargecurrency(field(doc.valueOf("//oaf:processingchargeamount/@currency"), info));
r.setInstance(instances);
r.setBestaccessright(OafMapperUtils.createBestAccessRights(instances));
r.setInstance(Arrays.asList(instance));
r.setBestaccessright(OafMapperUtils.createBestAccessRights(Arrays.asList(instance)));
r.setEoscifguidelines(prepareEOSCIfGuidelines(doc, info));
}
@ -509,7 +515,7 @@ public abstract class AbstractMdRecordToOafMapper {
protected abstract Qualifier prepareResourceType(Document doc, DataInfo info);
protected abstract List<Instance> prepareInstances(
protected abstract Instance prepareInstances(
Document doc,
DataInfo info,
KeyValue collectedfrom,

View File

@ -133,7 +133,7 @@ public class GenerateEntitiesApplication extends AbstractMigrationApplication {
inputRdd
.keyBy(oaf -> ModelSupport.idFn().apply(oaf))
.groupByKey()
.map(t -> MergeUtils.mergeGroup(t._1, t._2.iterator())),
.map(t -> MergeUtils.mergeGroup(t._2.iterator())),
// .mapToPair(oaf -> new Tuple2<>(ModelSupport.idFn().apply(oaf), oaf))
// .reduceByKey(MergeUtils::merge)
// .map(Tuple2::_2),

View File

@ -135,7 +135,7 @@ public class OafToOafMapper extends AbstractMdRecordToOafMapper {
}
@Override
protected List<Instance> prepareInstances(
protected Instance prepareInstances(
final Document doc,
final DataInfo info,
final KeyValue collectedfrom,
@ -197,7 +197,7 @@ public class OafToOafMapper extends AbstractMdRecordToOafMapper {
instance.getUrl().addAll(validUrl);
}
return Lists.newArrayList(instance);
return instance;
}
/**

View File

@ -126,7 +126,7 @@ public class OdfToOafMapper extends AbstractMdRecordToOafMapper {
}
@Override
protected List<Instance> prepareInstances(
protected Instance prepareInstances(
final Document doc,
final DataInfo info,
final KeyValue collectedfrom,
@ -210,7 +210,7 @@ public class OdfToOafMapper extends AbstractMdRecordToOafMapper {
instance.setUrl(new ArrayList<>());
instance.getUrl().addAll(validUrl);
}
return Arrays.asList(instance);
return instance;
}
protected String trimAndDecodeUrl(String url) {
@ -319,7 +319,7 @@ public class OdfToOafMapper extends AbstractMdRecordToOafMapper {
@Override
protected List<Field<String>> prepareDescriptions(final Document doc, final DataInfo info) {
return prepareListFields(doc, "//*[local-name()='description' and ./@descriptionType='Abstract']", info);
return prepareListFields(doc, "//datacite:description[./@descriptionType='Abstract'] | //dc:description", info);
}
@Override

View File

@ -80,9 +80,6 @@ public class PatchRelationsApplication {
final Dataset<Relation> rels = readPath(spark, relationPath, Relation.class);
final Dataset<RelationIdMapping> idMapping = readPath(spark, idMappingPath, RelationIdMapping.class);
log.info("relations: {}", rels.count());
log.info("idMapping: {}", idMapping.count());
final Dataset<Relation> bySource = rels
.joinWith(idMapping, rels.col("source").equalTo(idMapping.col("oldId")), "left")
.map((MapFunction<Tuple2<Relation, RelationIdMapping>, Relation>) t -> {

View File

@ -51,6 +51,7 @@
<arg>--orcidPath</arg><arg>${orcidPath}</arg>
<arg>--targetPath</arg><arg>${targetPath}</arg>
<arg>--graphPath</arg><arg>${graphPath}</arg>
<arg>--workingDir</arg><arg>${workingDir}</arg>
<arg>--master</arg><arg>yarn</arg>
</spark>
<ok to="reset_outputpath"/>

View File

@ -162,6 +162,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=15000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/publication</arg>
@ -197,6 +198,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=8000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/dataset</arg>
@ -232,6 +234,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=5000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/otherresearchproduct</arg>
@ -267,6 +270,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=2000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/software</arg>
@ -302,6 +306,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=1000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/datasource</arg>
@ -337,6 +342,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=1000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/organization</arg>
@ -372,6 +378,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=2000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/project</arg>
@ -407,6 +414,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=2000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/person</arg>
@ -442,6 +450,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.sql.shuffle.partitions=20000
</spark-opts>
<arg>--inputPath</arg><arg>${graphInputPath}/relation</arg>

View File

@ -0,0 +1,32 @@
[
{
"paramName": "issm",
"paramLongName": "isSparkSessionManaged",
"paramDescription": "when true will stop SparkSession after job execution",
"paramRequired": false
},
{
"paramName": "out",
"paramLongName": "outputPath",
"paramDescription": "the path to the graph data dump to read",
"paramRequired": true
},
{
"paramName": "mode",
"paramLongName": "mode",
"paramDescription": "mode (append|overwrite)",
"paramRequired": true
},
{
"paramName": "hmu",
"paramLongName": "hiveMetastoreUris",
"paramDescription": "the hive metastore uris",
"paramRequired": true
},
{
"paramName": "db",
"paramLongName": "hiveTableName",
"paramDescription": "the input hive table identifier",
"paramRequired": true
}
]

View File

@ -68,6 +68,7 @@
<path start="merge_datasource"/>
<path start="merge_organization"/>
<path start="merge_project"/>
<path start="merge_person"/>
<path start="merge_relation"/>
</fork>
@ -260,6 +261,33 @@
<error to="Kill"/>
</action>
<action name="merge_person">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Merge person</name>
<class>eu.dnetlib.dhp.oa.graph.merge.MergeGraphTableSparkJob</class>
<jar>dhp-graph-mapper-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--betaInputPath</arg><arg>${betaInputGraphPath}/person</arg>
<arg>--prodInputPath</arg><arg>${prodInputGraphPath}/person</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/person</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Person</arg>
<arg>--priority</arg><arg>${priority}</arg>
</spark>
<ok to="wait_merge"/>
<error to="Kill"/>
</action>
<action name="merge_relation">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>

View File

@ -649,6 +649,7 @@
<path start="merge_claims_datasource"/>
<path start="merge_claims_organization"/>
<path start="merge_claims_project"/>
<path start="merge_claims_person"/>
<path start="merge_claims_relation"/>
</fork>
@ -860,6 +861,32 @@
<error to="Kill"/>
</action>
<action name="merge_claims_person">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>MergeClaims_person</name>
<class>eu.dnetlib.dhp.oa.graph.raw.MergeClaimsApplication</class>
<jar>dhp-graph-mapper-${projectVersion}.jar</jar>
<spark-opts>
--executor-memory ${sparkExecutorMemory}
--executor-cores ${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=200
</spark-opts>
<arg>--rawGraphPath</arg><arg>${workingDir}/graph_raw</arg>
<arg>--claimsGraphPath</arg><arg>${workingDir}/graph_claims</arg>
<arg>--outputRawGaphPath</arg><arg>${graphOutputPath}</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Person</arg>
</spark>
<ok to="wait_merge"/>
<error to="Kill"/>
</action>
<join name="wait_merge" to="decisionPatchRelations"/>
<decision name="decisionPatchRelations">

View File

@ -1,12 +1,10 @@
package eu.dnetlib.dhp.oa.graph.raw
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.common.HdfsSupport
import eu.dnetlib.dhp.schema.common.ModelSupport
import eu.dnetlib.dhp.schema.oaf.Oaf
import eu.dnetlib.dhp.utils.DHPUtils
import org.apache.spark.sql.{Encoder, Encoders, SaveMode, SparkSession}
import org.apache.spark.sql.{Encoders, SaveMode, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.json4s.DefaultFormats
import org.json4s.jackson.JsonMethods.parse
@ -54,48 +52,60 @@ object CopyHdfsOafSparkApplication {
val hdfsPath = parser.get("hdfsPath")
log.info("hdfsPath: {}", hdfsPath)
implicit val oafEncoder: Encoder[Oaf] = Encoders.kryo[Oaf]
val paths =
DHPUtils.mdstorePaths(mdstoreManagerUrl, mdFormat, mdLayout, mdInterpretation, true).asScala
val validPaths: List[String] =
paths.filter(p => HdfsSupport.exists(p, sc.hadoopConfiguration)).toList
val types = ModelSupport.oafTypes.entrySet.asScala
.map(e => Tuple2(e.getKey, e.getValue))
if (validPaths.nonEmpty) {
val oaf = spark.read.textFile(validPaths: _*)
val mapper =
new ObjectMapper().configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
val oaf = spark.read
.textFile(validPaths: _*)
.map(v => (getOafType(v), v))(Encoders.tuple(Encoders.STRING, Encoders.STRING))
.cache()
types.foreach(t =>
oaf
.filter(o => isOafType(o, t._1))
.map(j => mapper.readValue(j, t._2).asInstanceOf[Oaf])
.map(s => mapper.writeValueAsString(s))(Encoders.STRING)
.write
.option("compression", "gzip")
.mode(SaveMode.Append)
.text(s"$hdfsPath/${t._1}")
)
try {
ModelSupport.oafTypes
.keySet()
.asScala
.foreach(entity =>
oaf
.filter(s"_1 = '${entity}'")
.selectExpr("_2")
.write
.option("compression", "gzip")
.mode(SaveMode.Append)
.text(s"$hdfsPath/${entity}")
)
} finally {
oaf.unpersist()
}
}
}
def isOafType(input: String, oafType: String): Boolean = {
def getOafType(input: String): String = {
implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats
lazy val json: org.json4s.JValue = parse(input)
if (oafType == "relation") {
val hasSource = (json \ "source").extractOrElse[String](null)
val hasTarget = (json \ "target").extractOrElse[String](null)
hasSource != null && hasTarget != null
val hasId = (json \ "id").extractOrElse[String](null)
val hasSource = (json \ "source").extractOrElse[String](null)
val hasTarget = (json \ "target").extractOrElse[String](null)
if (hasId == null && hasSource != null && hasTarget != null) {
"relation"
} else if (hasId != null) {
val oafType: String = ModelSupport.idPrefixEntity.get(hasId.substring(0, 2))
oafType match {
case "result" =>
(json \ "resulttype" \ "classid").extractOrElse[String](null) match {
case "other" => "otherresearchproduct"
case any => any
}
case _ => oafType
}
} else {
val hasId = (json \ "id").extractOrElse[String](null)
val resultType = (json \ "resulttype" \ "classid").extractOrElse[String]("")
hasId != null && oafType.startsWith(resultType)
null
}
}
}

View File

@ -133,7 +133,7 @@ object SparkCreateInputGraph {
val ds: Dataset[T] = spark.read.load(sourcePath).as[T]
ds.groupByKey(_.getId)
.mapGroups { (id, it) => MergeUtils.mergeGroup(id, it.asJava).asInstanceOf[T] }
.mapGroups { (id, it) => MergeUtils.mergeGroup(it.asJava).asInstanceOf[T] }
// .reduceGroups { (x: T, y: T) => MergeUtils.merge(x, y).asInstanceOf[T] }
// .map(_)
.write

View File

@ -1,8 +1,8 @@
package eu.dnetlib.dhp.oa.graph.raw;
import static org.junit.jupiter.api.Assertions.assertFalse;
import static org.junit.jupiter.api.Assertions.assertTrue;
import static eu.dnetlib.dhp.oa.graph.raw.CopyHdfsOafSparkApplication.getOafType;
import static org.junit.jupiter.api.Assertions.assertEquals;
import java.io.IOException;
@ -11,67 +11,24 @@ import org.junit.jupiter.api.Test;
public class CopyHdfsOafSparkApplicationTest {
String getResourceAsStream(String path) throws IOException {
return IOUtils.toString(getClass().getResourceAsStream(path));
}
@Test
void testIsOafType() throws IOException {
assertTrue(
CopyHdfsOafSparkApplication
.isOafType(
IOUtils
.toString(
getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/publication_1.json")),
"publication"));
assertTrue(
CopyHdfsOafSparkApplication
.isOafType(
IOUtils
.toString(
getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/dataset_1.json")),
"dataset"));
assertTrue(
CopyHdfsOafSparkApplication
.isOafType(
IOUtils
.toString(
getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/relation_1.json")),
"relation"));
assertFalse(
CopyHdfsOafSparkApplication
.isOafType(
IOUtils
.toString(
getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/publication_1.json")),
"dataset"));
assertFalse(
CopyHdfsOafSparkApplication
.isOafType(
IOUtils
.toString(
getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/dataset_1.json")),
"publication"));
assertTrue(
CopyHdfsOafSparkApplication
.isOafType(
IOUtils
.toString(
getClass()
.getResourceAsStream(
"/eu/dnetlib/dhp/oa/graph/raw/publication_2_unknownProperty.json")),
"publication"));
assertEquals("publication", getOafType(getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/publication_1.json")));
assertEquals("dataset", getOafType(getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/dataset_1.json")));
assertEquals("relation", getOafType(getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/relation_1.json")));
assertEquals("publication", getOafType(getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/publication_1.json")));
assertEquals(
"publication",
getOafType(getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/publication_2_unknownProperty.json")));
}
@Test
void isOafType_Datacite_ORP() throws IOException {
assertTrue(
CopyHdfsOafSparkApplication
.isOafType(
IOUtils
.toString(
getClass()
.getResourceAsStream(
"/eu/dnetlib/dhp/oa/graph/raw/datacite_orp.json")),
"otherresearchproduct"));
assertEquals(
"otherresearchproduct", getOafType(getResourceAsStream("/eu/dnetlib/dhp/oa/graph/raw/datacite_orp.json")));
}
}

View File

@ -906,6 +906,30 @@ class MappersTest {
assertEquals("IT", p.getCountry().get(0).getClassid());
assertEquals("FR", p.getCountry().get(1).getClassid());
assertEquals("DE", p.getCountry().get(2).getClassid());
assertNotNull(p.getDescription());
assertEquals(1, p.getDescription().size());
assertNotNull(p.getDescription().get(0));
assertTrue(StringUtils.isNotBlank(p.getDescription().get(0).getValue()));
}
@Test
void testODFRecord_guidelines4() throws IOException {
final String xml = IOUtils
.toString(Objects.requireNonNull(getClass().getResourceAsStream("odf_guidelines4.xml")));
final List<Oaf> list = new OdfToOafMapper(vocs, false, true).processMdRecord(xml);
final Publication p = (Publication) list.get(0);
assertValidId(p.getId());
assertValidId(p.getCollectedfrom().get(0).getKey());
assertTrue(StringUtils.isNotBlank(p.getTitle().get(0).getValue()));
assertNotNull(p.getDescription());
assertEquals(2, p.getDescription().size());
assertNotNull(p.getDescription().get(0));
assertTrue(StringUtils.isNotBlank(p.getDescription().get(0).getValue()));
assertNotNull(p.getDescription().get(1));
assertTrue(StringUtils.isNotBlank(p.getDescription().get(1).getValue()));
}
@Test

Some files were not shown because too many files have changed in this diff Show More