7096-fileGZip-collector-plugin #211

Merged
claudio.atzori merged 9 commits from 7096-fileGZip-collector-plugin into beta 2 years ago
schatz commented 2 years ago
Collaborator

Adding FileGZipCollectorPlugin, FileCollectorPlugin and their respective tests.

Adding FileGZipCollectorPlugin, FileCollectorPlugin and their respective tests.
claudio.atzori was assigned by schatz 2 years ago
andreas.czerniak was assigned by schatz 2 years ago
schatz added 3 commits 2 years ago
claudio.atzori added 1 commit 2 years ago
claudio.atzori added 1 commit 2 years ago
claudio.atzori reviewed 2 years ago
@ -0,0 +17,4 @@
log.info("baseUrl: {}", baseUrl);
try {
return new BufferedInputStream(new FileInputStream(baseUrl));
Owner

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the `org.apache.hadoop.fs.FileSystem` object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84
schatz commented 2 years ago
Poster
Collaborator

Thank you for pointing this, I have updated the code to use org.apache.hadoop.fs.FileSystem for that purpose.

Thank you for pointing this, I have updated the code to use `org.apache.hadoop.fs.FileSystem` for that purpose.
claudio.atzori reviewed 2 years ago
@ -0,0 +19,4 @@
log.info("baseUrl: {}", baseUrl);
try {
GZIPInputStream stream = new GZIPInputStream(new FileInputStream(baseUrl));
Owner

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84

We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84
schatz marked this conversation as resolved
claudio.atzori requested changes 2 years ago
claudio.atzori left a comment
Owner

The major remark I have for this PR is the mean of access to content of the files that the plugins are meant to collect in the aggregation task. In fact, we assume to use these plugins when we got the files that we want to include in the aggregation pipeline, without worrying about reimplementing or scripting the procedures for storing them in their respective MetadataStores. Therefore the underlying assumption is that those files are already stored on HDFS in some path (baseurl parameter).

Please see the comments inline.

The major remark I have for this PR is the mean of access to content of the files that the plugins are meant to collect in the aggregation task. In fact, we assume to use these plugins when we got the files that we want to include in the aggregation pipeline, without worrying about reimplementing or scripting the procedures for storing them in their respective MetadataStores. Therefore the underlying assumption is that those files are already stored on HDFS in some path (baseurl parameter). Please see the comments inline.
claudio.atzori added 1 commit 2 years ago
schatz added 1 commit 2 years ago
schatz requested review from claudio.atzori 2 years ago
schatz commented 2 years ago
Poster
Collaborator

I have made the required changes to use org.apache.hadoop.fs.FileSystem for accessing the input files from HDFS.
Please, @claudio.atzori can you check that it is ok now ? Thanks

I have made the required changes to use `org.apache.hadoop.fs.FileSystem` for accessing the input files from HDFS. Please, @claudio.atzori can you check that it is ok now ? Thanks
claudio.atzori added 1 commit 2 years ago
claudio.atzori added 1 commit 2 years ago
Owner

Looks good now, thanks @schatz !

Looks good now, thanks @schatz !
claudio.atzori merged commit c76ff6c613 into beta 2 years ago

Reviewers

claudio.atzori was requested for review 2 years ago
The pull request has been merged as c76ff6c613.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b 7096-fileGZip-collector-plugin beta
git pull origin 7096-fileGZip-collector-plugin

Step 2:

Merge the changes and update on Gitea.
git checkout beta
git merge --no-ff 7096-fileGZip-collector-plugin
git push origin beta
Sign in to join this conversation.
No reviewers
No Milestone
No project
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#211
Loading…
There is no content yet.