7096-fileGZip-collector-plugin #211

schatz · 2022-04-07T14:18:09+02:00

schatz commented

2022-04-07 14:18:09 +02:00

Adding FileGZipCollectorPlugin, FileCollectorPlugin and their respective tests.

claudio.atzori was assigned by schatz

2022-04-07 14:18:09 +02:00

andreas.czerniak was assigned by schatz

2022-04-07 14:18:09 +02:00

schatz added 3 commits 2022-04-07 14:18:10 +02:00

e612489670 Add fileGZip collector plugin and respective test

bc1bf55507 Add AbstractSplittedRecordPlugin

d0b84d3297 Add FileCollectorPlugin and respective test

claudio.atzori added 1 commit 2022-04-21 11:42:48 +02:00

eabb40fccc Merge branch 'beta' into 7096-fileGZip-collector-plugin

claudio.atzori added 1 commit 2022-04-22 11:22:26 +02:00

30105f0722 Merge branch 'beta' into 7096-fileGZip-collector-plugin

claudio.atzori reviewed 2022-04-22 11:30:49 +02:00

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/collection/plugin/file/FileCollectorPlugin.java Outdated

						
				@ -0,0 +17,4 @@

				        log.info("baseUrl: {}", baseUrl);

				        try {

				            return new BufferedInputStream(new FileInputStream(baseUrl));

claudio.atzori commented

2022-04-22 11:30:49 +02:00

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the `org.apache.hadoop.fs.FileSystem` object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84

schatz commented

2022-04-28 15:35:01 +02:00

Thank you for pointing this, I have updated the code to use org.apache.hadoop.fs.FileSystem for that purpose.

Thank you for pointing this, I have updated the code to use `org.apache.hadoop.fs.FileSystem` for that purpose.

claudio.atzori reviewed 2022-04-22 11:31:12 +02:00

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/collection/plugin/file/FileGZipCollectorPlugin.java

						
				@ -0,0 +19,4 @@

				        log.info("baseUrl: {}", baseUrl);

				        try {

				            GZIPInputStream stream = new GZIPInputStream(new FileInputStream(baseUrl));

claudio.atzori commented

2022-04-22 11:31:12 +02:00

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84

schatz marked this conversation as resolved

claudio.atzori requested changes 2022-04-22 11:36:31 +02:00

claudio.atzori left a comment

The major remark I have for this PR is the mean of access to content of the files that the plugins are meant to collect in the aggregation task. In fact, we assume to use these plugins when we got the files that we want to include in the aggregation pipeline, without worrying about reimplementing or scripting the procedures for storing them in their respective MetadataStores. Therefore the underlying assumption is that those files are already stored on HDFS in some path (baseurl parameter).

Please see the comments inline.

The major remark I have for this PR is the mean of access to content of the files that the plugins are meant to collect in the aggregation task. In fact, we assume to use these plugins when we got the files that we want to include in the aggregation pipeline, without worrying about reimplementing or scripting the procedures for storing them in their respective MetadataStores. Therefore the underlying assumption is that those files are already stored on HDFS in some path (baseurl parameter). Please see the comments inline.

claudio.atzori added 1 commit 2022-04-26 09:02:17 +02:00

81c4496d32 Merge branch 'beta' into 7096-fileGZip-collector-plugin

schatz added 1 commit 2022-04-28 15:31:24 +02:00

623f7be26d Fix reading files from HDFS in FileCollector & FileGZipCollector plugins

schatz requested review from claudio.atzori 2022-04-28 15:33:35 +02:00

schatz commented

2022-04-28 15:36:48 +02:00

I have made the required changes to use org.apache.hadoop.fs.FileSystem for accessing the input files from HDFS.
Please, @claudio.atzori can you check that it is ok now ? Thanks

I have made the required changes to use `org.apache.hadoop.fs.FileSystem` for accessing the input files from HDFS. Please, @claudio.atzori can you check that it is ok now ? Thanks

claudio.atzori added 1 commit 2022-06-16 09:22:18 +02:00

06b5533d4c Merge branch 'beta' into 7096-fileGZip-collector-plugin

claudio.atzori added 1 commit 2022-06-16 09:28:52 +02:00

c7b09c6225 Merge branch 'beta' into 7096-fileGZip-collector-plugin

claudio.atzori commented

2022-06-16 15:34:40 +02:00

Looks good now, thanks @schatz !