additional XSLT transformation scripts, enhance methods #97
Closed
andreas.czerniak
wants to merge 49 commits from
hadoop_aggregator
into hadoop_aggregator
pull from: hadoop_aggregator
merge into: D-Net:hadoop_aggregator
D-Net:master
D-Net:beta
D-Net:beta_provision_relation
D-Net:irish-oaipmh-exporter
D-Net:spark34-integration
D-Net:dependency-revision
D-Net:beta-release-1.2.5
D-Net:rest-collector-plugin-with-retry
D-Net:misc_fixes_merge_entities
D-Net:WebCrowlBeta
D-Net:WebCrowl
D-Net:provision_memoryOverhead
D-Net:stats_step16_fix
D-Net:doidoost_dismiss
D-Net:CrossrefFundersMap
D-Net:taggingProjects
D-Net:9647_datacite_affiliations
D-Net:UsageStatsRecordDS
D-Net:mergeutils
D-Net:oaf_country_beta
D-Net:index_records
D-Net:ocnew
D-Net:FOSNew
D-Net:bulkTaggingPathMapExtention
D-Net:transformativeagreement
D-Net:new_orcid_enhancement
D-Net:9559_DBLP_data
D-Net:base_stats_job_deprecated
D-Net:SWH_issue_377
D-Net:import_orps_fix
D-Net:spark_join_param_tuning
D-Net:crossref_mapping_vocabularies
D-Net:promote_actions_join_type_master
D-Net:promote_actions_join_type
D-Net:provision_community_api
D-Net:enrichmentSingleStepFixed
D-Net:fosPreparationBeta
D-Net:resource_types
D-Net:enrichmentSingleStep
D-Net:oldPropagationOrganizationCommunity
D-Net:beta_to_master_dicember2023
D-Net:orcid_import
D-Net:9078_xml_records_irish_tender
D-Net:clean_license_publisher
D-Net:bulkTag
D-Net:SWH_integration
D-Net:importpoci
D-Net:8172_impact_indicators_workflow
D-Net:dedup-with-dataframe-spark34
D-Net:8876
D-Net:master_july23
D-Net:distinct_pids_from_openorgs_beta
D-Net:propagationProjectThroughParentChils
D-Net:fulltext_url_validation
D-Net:removeTaggingCondition
D-Net:ticket_8369
D-Net:tweaking_spark_params
D-Net:fc4e-rsac
D-Net:doiboost_authormerger
D-Net:beta_dedup_configuration
D-Net:apc_affiliation
D-Net:bulkTagRefactor
D-Net:organizationToRepresentative
D-Net:graph_cleaning_refactoring
D-Net:scholix_flat_indexing
D-Net:scholix_data_type_openaire
D-Net:advConstraintsInBeta
D-Net:doiboostMappingExtention
D-Net:mag_citation_relation
D-Net:h2020classification
D-Net:doiboostFunderExtention
D-Net:citations_monodirectional
D-Net:compatibility_order
D-Net:8232-mdstore-synch-improve
D-Net:subjectPropagation
D-Net:pubmed_to_production
D-Net:cleanCountryOnMaster
D-Net:graph_cleaning
D-Net:deduptesting
D-Net:horizontalConstraints
D-Net:enrichment
D-Net:scholix_to_solr
D-Net:transformation_wf
D-Net:discard-non-wellformed
D-Net:removeDump
D-Net:eosc_context_tagging
D-Net:pubmed_update
D-Net:doiboost_refactor
D-Net:clean_context_master
D-Net:monitoring
D-Net:dump_new_funded_products
D-Net:dump_delta_projects
D-Net:dump
D-Net:7096-fileGZip-collector-plugin
D-Net:oaf_relation_mapping
D-Net:validation
D-Net:native_records_migration
D-Net:instance_group_by_url
D-Net:hostedByMap_update
D-Net:hostedByMap_oastartdate
D-Net:sygma_indexing
No reviewers
Labels
Clear labels
Something is not working
This issue or pull request already exists
New feature / refactoring
Need some help
Something is wrong
More information is needed
EOSC Research Discovery Graph
EOSC Research Software APIs and Connectors
This won't be fixed
bug
Something is not working
duplicate
This issue or pull request already exists
enhancement
New feature / refactoring
help wanted
Need some help
invalid
Something is wrong
question
More information is needed
RDGraph
EOSC Research Discovery Graph
RSAC
EOSC Research Software APIs and Connectors
wontfix
This won't be fixed
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
Milestone
Clear milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
Clear assignees
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#97
Reference in New Issue
No description provided.
Delete Branch "hadoop_aggregator"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Before integrating the PR I'd like to ask you about background information that motivated the introduction of yet another class for the implementation of the vocabulary-based cleaning. In fact the same operations performed by
TransformationFunctionProxy
seems to be already available in Cleaner.java.If your intention is to reuse the
xslt_cleaning_datarepo_datacite.xsl
transformation rule as it is currently defined in the production system, I am afraid it won't be possible without introducing few adjustments, therefore we should probably prepare to a transition phase where the transformation rules/scripts will be duplicated and progressively migrated towards the updated definitions. We could agree to reuse as much as possible from the current definitions, perhaps maintaining the same extension function specification, but I don't see how much we would benefit in the end as the updated XSLT engine assmes a syntax that is different for the invocation of the extention functions. In fact, the current transformation scripts assumes to declareand then
while in the new implementation you don't need to instantiate the
tr
variable and most importantly, you cannot pass it to theconvertString
function.Said that, the transformation scripts will need to be revised anyway, thus I propose that we agree on how the extension functions should be named.
Please have a look at the two (i) vocabulary-based Cleaner and (ii) date Cleaner functions available here.
Furthermore, note that the unit test
eu.dnetlib.dhp.transformation.TransformationJobTest#testTransformSaxonHE
already showcases the transformation of a record from Zenodo with the TR provided by Sandro (zenodo_tr.xslt).Side comment: why does the
TransformationFunctionProxy
class include Kafka related, yet commented out, code lines? Probably Kafka will be a good ally of ours in future, but today I feel like we should focus on more short term goals that do not involve deep architectural changes.Lastly, the file
dc_cleaning_OPENAIREplus_compliant.jxslt
is added, but never used in any test, so I'm going to ignore it until it will be used.WIP: additional XSLT transformation scripts, enhance methodsto additional XSLT transformation scripts, enhance methodsAndreas could you please pull the most recent changes from the
hadoop_aggregator
branch on the upstream project so that the conflicts gets resolved?PR manually merged in
fa7930d2e2
Pull request closed