Issues in the XSLT transformation workflow #252

Open
opened 2022-09-28 12:00:08 +02:00 by claudio.atzori · 1 comment

I noticed a couple of issues in the transformation workflow implementation

  1. it is currently instantiating the XSLT transformation for each invocation of the .call method. Although the execution happens in parallel on the cluster, the XsltTransformer object used in f3f7604e6c/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/transformation/xslt/XSLTTransformationFunction.java (L68) should be instantiated once in the constructor and reused for each invocation of the .call method.
  2. the approach described in the point above causes that in case the XSLT contains syntactic errors that prevents creating the XsltTransformer, the errors are thrown N times, with N equals to the number of input records. Note that such errors are neven actually propagated to the caller because the block is wrapped in a try/catch block that return null in case of error.
    I believe this is wrong: such errors should happen only once at instantiation time, while errors occurring during the the application of the transformation should break the process.
  3. It might seem that the preparations for creating the XsltTransformer need inputs from the input MetadataRecord (e.g. setting the Datasource ID/NAME parameters) however, these should not come from each record, instead the TransformSparkJobNode should be aware of such details, just like it happens with the metadata collection workflow with the parameter dataSourceInfo.
I noticed a couple of issues in the transformation workflow implementation 1. it is currently instantiating the XSLT transformation for each invocation of the `.call` method. Although the execution happens in parallel on the cluster, the `XsltTransformer` object used in https://code-repo.d4science.org/D-Net/dnet-hadoop/src/commit/f3f7604e6c342a4ec931ec5d1e5fb7370da029c3/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/transformation/xslt/XSLTTransformationFunction.java#L68 should be instantiated once in the constructor and reused for each invocation of the `.call` method. 2. the approach described in the point above causes that in case the XSLT contains syntactic errors that prevents creating the `XsltTransformer`, the errors are thrown N times, with N equals to the number of input records. Note that such errors are neven actually propagated to the caller because the block is wrapped in a try/catch block that return null in case of error. I believe this is wrong: such errors should happen only once at instantiation time, while errors occurring during the the application of the transformation should break the process. 3. It might seem that the preparations for creating the `XsltTransformer` need inputs from the input `MetadataRecord` (e.g. setting the Datasource ID/NAME parameters) however, these should not come from each record, instead the `TransformSparkJobNode` should be aware of such details, just like it happens with the metadata collection workflow with the parameter `dataSourceInfo`.
claudio.atzori added the
bug
label 2022-09-28 12:00:08 +02:00
sandro.labruzzo was assigned by claudio.atzori 2022-09-28 12:00:10 +02:00
Author
Owner

It turned out the creation of the XsltTransformer must be performed within the .call method implementation. In fact, definiting it as a field of the class XSLTTransformationFunction makes spark to complain as that object it is not serializable, just like the other classes from the Saxon api.

Therefore, the only margin of improvement I can see in the current implementation is to make the creation of the XsltTransformer cause the interruption of the process in case of issues in the XSLT, while letting the application of it to return null in case of errors, i.e. define two separated try/catch blocks to isolate the phases.

It turned out the creation of the `XsltTransformer` must be performed within the `.call` method implementation. In fact, definiting it as a field of the class `XSLTTransformationFunction` makes spark to complain as that object it is not serializable, just like the other classes from the Saxon api. Therefore, the only margin of improvement I can see in the current implementation is to make the creation of the `XsltTransformer` cause the interruption of the process in case of issues in the XSLT, while letting the application of it to return null in case of errors, i.e. define two separated try/catch blocks to isolate the phases.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#252
No description provided.