7096-fileGZip-collector-plugin #211
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#211
Loading…
Reference in New Issue
No description provided.
Delete Branch "7096-fileGZip-collector-plugin"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adding FileGZipCollectorPlugin, FileCollectorPlugin and their respective tests.
@ -0,0 +17,4 @@
log.info("baseUrl: {}", baseUrl);
try {
return new BufferedInputStream(new FileInputStream(baseUrl));
We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the
org.apache.hadoop.fs.FileSystem
object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84Thank you for pointing this, I have updated the code to use
org.apache.hadoop.fs.FileSystem
for that purpose.@ -0,0 +19,4 @@
log.info("baseUrl: {}", baseUrl);
try {
GZIPInputStream stream = new GZIPInputStream(new FileInputStream(baseUrl));
We are assuming that the files to be processed by these plugins are already stored on HDFS, right? I'm not sure that accessing them in this way it would work. I think the access should be made through the org.apache.hadoop.fs.FileSystem object, similarily to how it is done on this other class: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/aggregation/mdstore/MDStoreActionNode.java#L84
The major remark I have for this PR is the mean of access to content of the files that the plugins are meant to collect in the aggregation task. In fact, we assume to use these plugins when we got the files that we want to include in the aggregation pipeline, without worrying about reimplementing or scripting the procedures for storing them in their respective MetadataStores. Therefore the underlying assumption is that those files are already stored on HDFS in some path (baseurl parameter).
Please see the comments inline.
I have made the required changes to use
org.apache.hadoop.fs.FileSystem
for accessing the input files from HDFS.Please, @claudio.atzori can you check that it is ok now ? Thanks
Looks good now, thanks @schatz !