Lampros Smyrnaios 776243e6d5 | ||
---|---|---|
LICENSE | ||
README.md | ||
pdfCoverageEvaluator.py | ||
requirements.txt | ||
transferToRemoteMachine.sh | ||
transferToRemoteMachineAndExecute.sh |
README.md
PDF-coverage-evaluator
This python-3 script checks each PID from a given collection, against the pdf-aggregation-service-DB,
in order to find whether that PID exists in the aggregator's DB and has full-text coverage.
In detail, it does the following:
- extracts the pids from a json-file (DOIs and PMIDs)
- if a "previous-results" file is provided, extracts the pid from there as well and reduces the original input to the pids which have not been processed before.
- splits them in batches, and for each batch it submits each pid-evaluation-job to a "ThreadPoolExecutor", which uses 12 threads.
- for each one of the PID-pairs, makes a query with Impala, to quickly acquire the following: "dedupid", "id", "pid", "pid_type", "fulltext_url", "location"
- saves the results in a json-file, including the pid for which it made the check (for example in case a record has both "doi" and "pmid" and a fulltext was detected for the "doi" (at least), then the output-record has the "doi" as its PID)
Install & Run:
python3 --version; sudo apt install -y python3 python3-pip; sudo pip3 install --upgrade pip; cd pdfCoverageEvaluator; sudo pip3 install -r requirements.txt; python3 pdfCoverageEvaluator.py ${input_file_path} ${max_num_to_process} ${previous_results_file_path}
Install & run, using the provided scripts
- transferToRemoteMachineAndExecute.sh: this script transfers the project to the defined location on a remote machine, replaces the project-files there and executes the software it inside a screen, in order to not lose the execution in case the session is closed before the software finishes.
- transferToRemoteMachine.sh: this script just transfers the project to the defined location on a remote machine and replaces the project-files there.
Checking the logs
The log file is located inside the project's directory and has this name: "pdfCoverageEvaluator.log"