Issues in the XSLT transformation workflow #252
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#252
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I noticed a couple of issues in the transformation workflow implementation
.call
method. Although the execution happens in parallel on the cluster, theXsltTransformer
object used inf3f7604e6c/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/transformation/xslt/XSLTTransformationFunction.java (L68)
should be instantiated once in the constructor and reused for each invocation of the.call
method.XsltTransformer
, the errors are thrown N times, with N equals to the number of input records. Note that such errors are neven actually propagated to the caller because the block is wrapped in a try/catch block that return null in case of error.I believe this is wrong: such errors should happen only once at instantiation time, while errors occurring during the the application of the transformation should break the process.
XsltTransformer
need inputs from the inputMetadataRecord
(e.g. setting the Datasource ID/NAME parameters) however, these should not come from each record, instead theTransformSparkJobNode
should be aware of such details, just like it happens with the metadata collection workflow with the parameterdataSourceInfo
.It turned out the creation of the
XsltTransformer
must be performed within the.call
method implementation. In fact, definiting it as a field of the classXSLTTransformationFunction
makes spark to complain as that object it is not serializable, just like the other classes from the Saxon api.Therefore, the only margin of improvement I can see in the current implementation is to make the creation of the
XsltTransformer
cause the interruption of the process in case of issues in the XSLT, while letting the application of it to return null in case of errors, i.e. define two separated try/catch blocks to isolate the phases.