additional XSLT transformation scripts, enhance methods #97

Closed
andreas.czerniak wants to merge 49 commits from hadoop_aggregator into hadoop_aggregator
  • add transformation script: xslt_cleaning_datarepo_datacite.xsl (used 46x in prod)
  • change to use more long path+filename in dhp-graph-mapper
  • add java source and version properties
* add transformation script: xslt_cleaning_datarepo_datacite.xsl (used 46x in prod) * change to use more long path+filename in dhp-graph-mapper * add java source and version properties

Before integrating the PR I'd like to ask you about background information that motivated the introduction of yet another class for the implementation of the vocabulary-based cleaning. In fact the same operations performed by TransformationFunctionProxy seems to be already available in Cleaner.java.

If your intention is to reuse the xslt_cleaning_datarepo_datacite.xsl transformation rule as it is currently defined in the production system, I am afraid it won't be possible without introducing few adjustments, therefore we should probably prepare to a transition phase where the transformation rules/scripts will be duplicated and progressively migrated towards the updated definitions. We could agree to reuse as much as possible from the current definitions, perhaps maintaining the same extension function specification, but I don't see how much we would benefit in the end as the updated XSLT engine assmes a syntax that is different for the invocation of the extention functions. In fact, the current transformation scripts assumes to declare

<xsl:variable name="tf" select="TransformationFunction:getInstance()" />

and then

<xsl:variable name="varEmbargoEndDate" select="TransformationFunction:convertString($tf, normalize-space(//*[local-name()='date'][@dateType='Available']), 'DateISO8601')" />

while in the new implementation you don't need to instantiate the tr variable and most importantly, you cannot pass it to the convertString function.

Said that, the transformation scripts will need to be revised anyway, thus I propose that we agree on how the extension functions should be named.

Please have a look at the two (i) vocabulary-based Cleaner and (ii) date Cleaner functions available here.

Furthermore, note that the unit test eu.dnetlib.dhp.transformation.TransformationJobTest#testTransformSaxonHE already showcases the transformation of a record from Zenodo with the TR provided by Sandro (zenodo_tr.xslt).

Side comment: why does the TransformationFunctionProxy class include Kafka related, yet commented out, code lines? Probably Kafka will be a good ally of ours in future, but today I feel like we should focus on more short term goals that do not involve deep architectural changes.

Lastly, the file dc_cleaning_OPENAIREplus_compliant.jxslt is added, but never used in any test, so I'm going to ignore it until it will be used.

Before integrating the PR I'd like to ask you about background information that motivated the introduction of yet another class for the implementation of the vocabulary-based cleaning. In fact the same operations performed by `TransformationFunctionProxy` seems to be already available in [Cleaner.java](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/hadoop_aggregator/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/transformation/xslt/Cleaner.java). If your intention is to reuse the `xslt_cleaning_datarepo_datacite.xsl` transformation rule as it is currently defined in the production system, I am afraid it won't be possible without introducing few adjustments, therefore we should probably prepare to a transition phase where the transformation rules/scripts will be duplicated and progressively migrated towards the updated definitions. We could agree to reuse as much as possible from the current definitions, perhaps maintaining the same extension function specification, but I don't see how much we would benefit in the end as the updated XSLT engine assmes a syntax that is different for the invocation of the extention functions. In fact, the current transformation scripts assumes to declare ``` <xsl:variable name="tf" select="TransformationFunction:getInstance()" /> ``` and then ``` <xsl:variable name="varEmbargoEndDate" select="TransformationFunction:convertString($tf, normalize-space(//*[local-name()='date'][@dateType='Available']), 'DateISO8601')" /> ``` while in the new implementation you don't need to instantiate the `tr` variable and most importantly, you cannot pass it to the `convertString` function. Said that, the transformation scripts will need to be revised anyway, thus I propose that we agree on how the extension functions should be named. Please have a look at the two (i) vocabulary-based Cleaner and (ii) date Cleaner functions available [here](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/hadoop_aggregator/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/transformation/xsl). Furthermore, note that the unit test `eu.dnetlib.dhp.transformation.TransformationJobTest#testTransformSaxonHE` already showcases the transformation of a record from Zenodo with the TR provided by Sandro ([zenodo_tr.xslt](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/hadoop_aggregator/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/transform/zenodo_tr.xslt)). Side comment: why does the `TransformationFunctionProxy` class include Kafka related, yet commented out, code lines? Probably Kafka will be a good ally of ours in future, but today I feel like we should focus on more short term goals that do not involve deep architectural changes. Lastly, the file `dc_cleaning_OPENAIREplus_compliant.jxslt` is added, but never used in any test, so I'm going to ignore it until it will be used.
claudio.atzori changed title from WIP: additional XSLT transformation scripts, enhance methods to additional XSLT transformation scripts, enhance methods 2021-03-05 15:11:22 +01:00

Andreas could you please pull the most recent changes from the hadoop_aggregator branch on the upstream project so that the conflicts gets resolved?

Andreas could you please pull the most recent changes from the `hadoop_aggregator` branch on the upstream project so that the conflicts gets resolved?

PR manually merged in fa7930d2e2

PR manually merged in fa7930d2e2c4aeda1ee42018be065826367dc96e
claudio.atzori closed this pull request 2021-03-05 15:47:01 +01:00

Pull request closed

Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#97
No description provided.