Issues in the XSLT transformation workflow #252

New Issue

claudio.atzori · 2022-09-28T12:00:08+02:00

claudio.atzori commented

2022-09-28 12:00:08 +02:00

I noticed a couple of issues in the transformation workflow implementation

it is currently instantiating the XSLT transformation for each invocation of the .call method. Although the execution happens in parallel on the cluster, the XsltTransformer object used in f3f7604e6c/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/transformation/xslt/XSLTTransformationFunction.java (L68) should be instantiated once in the constructor and reused for each invocation of the .call method.
the approach described in the point above causes that in case the XSLT contains syntactic errors that prevents creating the XsltTransformer, the errors are thrown N times, with N equals to the number of input records. Note that such errors are neven actually propagated to the caller because the block is wrapped in a try/catch block that return null in case of error.
I believe this is wrong: such errors should happen only once at instantiation time, while errors occurring during the the application of the transformation should break the process.
It might seem that the preparations for creating the XsltTransformer need inputs from the input MetadataRecord (e.g. setting the Datasource ID/NAME parameters) however, these should not come from each record, instead the TransformSparkJobNode should be aware of such details, just like it happens with the metadata collection workflow with the parameter dataSourceInfo.

I noticed a couple of issues in the transformation workflow implementation 1. it is currently instantiating the XSLT transformation for each invocation of the `.call` method. Although the execution happens in parallel on the cluster, the `XsltTransformer` object used in https://code-repo.d4science.org/D-Net/dnet-hadoop/src/commit/f3f7604e6c342a4ec931ec5d1e5fb7370da029c3/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/transformation/xslt/XSLTTransformationFunction.java#L68 should be instantiated once in the constructor and reused for each invocation of the `.call` method. 2. the approach described in the point above causes that in case the XSLT contains syntactic errors that prevents creating the `XsltTransformer`, the errors are thrown N times, with N equals to the number of input records. Note that such errors are neven actually propagated to the caller because the block is wrapped in a try/catch block that return null in case of error. I believe this is wrong: such errors should happen only once at instantiation time, while errors occurring during the the application of the transformation should break the process. 3. It might seem that the preparations for creating the `XsltTransformer` need inputs from the input `MetadataRecord` (e.g. setting the Datasource ID/NAME parameters) however, these should not come from each record, instead the `TransformSparkJobNode` should be aware of such details, just like it happens with the metadata collection workflow with the parameter `dataSourceInfo`.

claudio.atzori added the

bug

label 2022-09-28 12:00:08 +02:00

sandro.labruzzo was assigned by claudio.atzori

2022-09-28 12:00:10 +02:00

claudio.atzori commented

2022-09-28 13:21:45 +02:00

It turned out the creation of the XsltTransformer must be performed within the .call method implementation. In fact, definiting it as a field of the class XSLTTransformationFunction makes spark to complain as that object it is not serializable, just like the other classes from the Saxon api.

Therefore, the only margin of improvement I can see in the current implementation is to make the creation of the XsltTransformer cause the interruption of the process in case of issues in the XSLT, while letting the application of it to return null in case of errors, i.e. define two separated try/catch blocks to isolate the phases.

It turned out the creation of the `XsltTransformer` must be performed within the `.call` method implementation. In fact, definiting it as a field of the class `XSLTTransformationFunction` makes spark to complain as that object it is not serializable, just like the other classes from the Saxon api. Therefore, the only margin of improvement I can see in the current implementation is to make the creation of the `XsltTransformer` cause the interruption of the process in case of issues in the XSLT, while letting the application of it to return null in case of errors, i.e. define two separated try/catch blocks to isolate the phases.

claudio.atzori referenced this issue

2022-09-28 14:18:50 +02:00

[metadata transformation] better error handling in XSLT transformation #253

Sign in to join this conversation.