7096-fileGZip-collector-plugin #211

Merged
claudio.atzori merged 9 commits from 7096-fileGZip-collector-plugin into beta 2022-06-16 15:34:45 +02:00
Member

Adding FileGZipCollectorPlugin, FileCollectorPlugin and their respective tests.

Adding FileGZipCollectorPlugin, FileCollectorPlugin and their respective tests.
claudio.atzori was assigned by schatz 2022-04-07 14:18:09 +02:00
andreas.czerniak was assigned by schatz 2022-04-07 14:18:09 +02:00
schatz added 3 commits 2022-04-07 14:18:10 +02:00
claudio.atzori added 1 commit 2022-04-21 11:42:48 +02:00
claudio.atzori added 1 commit 2022-04-22 11:22:26 +02:00
claudio.atzori reviewed 2022-04-22 11:30:49 +02:00
@ -0,0 +17,4 @@
log.info("baseUrl: {}", baseUrl);
try {
return new BufferedInputStream(new FileInputStream(baseUrl));

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the `org.apache.hadoop.fs.FileSystem` object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84
Author
Member

Thank you for pointing this, I have updated the code to use org.apache.hadoop.fs.FileSystem for that purpose.

Thank you for pointing this, I have updated the code to use `org.apache.hadoop.fs.FileSystem` for that purpose.
claudio.atzori reviewed 2022-04-22 11:31:12 +02:00
@ -0,0 +19,4 @@
log.info("baseUrl: {}", baseUrl);
try {
GZIPInputStream stream = new GZIPInputStream(new FileInputStream(baseUrl));

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84
schatz marked this conversation as resolved
claudio.atzori requested changes 2022-04-22 11:36:31 +02:00
claudio.atzori left a comment
Owner

The major remark I have for this PR is the mean of access to content of the files that the plugins are meant to collect in the aggregation task. In fact, we assume to use these plugins when we got the files that we want to include in the aggregation pipeline, without worrying about reimplementing or scripting the procedures for storing them in their respective MetadataStores. Therefore the underlying assumption is that those files are already stored on HDFS in some path (baseurl parameter).

Please see the comments inline.

The major remark I have for this PR is the mean of access to content of the files that the plugins are meant to collect in the aggregation task. In fact, we assume to use these plugins when we got the files that we want to include in the aggregation pipeline, without worrying about reimplementing or scripting the procedures for storing them in their respective MetadataStores. Therefore the underlying assumption is that those files are already stored on HDFS in some path (baseurl parameter). Please see the comments inline.
claudio.atzori added 1 commit 2022-04-26 09:02:17 +02:00
schatz added 1 commit 2022-04-28 15:31:24 +02:00
schatz requested review from claudio.atzori 2022-04-28 15:33:35 +02:00
Author
Member

I have made the required changes to use org.apache.hadoop.fs.FileSystem for accessing the input files from HDFS.
Please, @claudio.atzori can you check that it is ok now ? Thanks

I have made the required changes to use `org.apache.hadoop.fs.FileSystem` for accessing the input files from HDFS. Please, @claudio.atzori can you check that it is ok now ? Thanks
claudio.atzori added 1 commit 2022-06-16 09:22:18 +02:00
claudio.atzori added 1 commit 2022-06-16 09:28:52 +02:00

Looks good now, thanks @schatz !

Looks good now, thanks @schatz !
claudio.atzori merged commit c76ff6c613 into beta 2022-06-16 15:34:45 +02:00
Sign in to join this conversation.
No description provided.