Go to file
Lampros Smyrnaios e6d6382bd0 Add the pdfCoverageEvaluator software. 2024-02-19 13:08:26 +02:00
LICENSE Add the pdfCoverageEvaluator software. 2024-02-19 13:08:26 +02:00
README.md Add the pdfCoverageEvaluator software. 2024-02-19 13:08:26 +02:00
pdfCoverageEvaluator.py Add the pdfCoverageEvaluator software. 2024-02-19 13:08:26 +02:00
requirements.txt Add the pdfCoverageEvaluator software. 2024-02-19 13:08:26 +02:00
transferToRemoteMachine.sh Add the pdfCoverageEvaluator software. 2024-02-19 13:08:26 +02:00
transferToRemoteMachineAndExecute.sh Add the pdfCoverageEvaluator software. 2024-02-19 13:08:26 +02:00

README.md

PDF-coverage-evaluator

This python-3 script checks each PID from a given collection, against the pdf-aggregation-service-DB, in order to find whether that PID exists in the aggregator's DB and has full-text coverage.

In detail, it does the following:

  • extracts the pids from a json-file (DOIs and PMIDs)
  • if a "previous-results" file is provided, extracts the pid from there as well and reduced the original input to the pids which have not been processed before.
  • splits them in batches and for each batch it submits each pid-evaluation-job to a "ThreadPoolExecutor", which uses 12 threads.
  • for each one of the PID-pairs, makes a query with Impala, to quickly acquire the following: "dedupid", "id", "pid", "pid_type", "fulltext_url", "location"
  • saves the results in a json-file, including the pid for which it made the check (for example in case a record has both "doi" and "pmid" and a fulltext was detected for the "doi" (at least), then the output-record has the "doi" as its PID)

Install & Run:

python3 --version; sudo apt install -y python3 python3-pip; sudo pip3 install --upgrade pip; cd pdfCoverageEvaluator; sudo pip3 install -r requirements.txt; python3 pdfCoverageEvaluator.py ${input_file_path} ${max_num_to_process} ${previous_results_file_path}

Install & run, using the provided scripts

  1. transferToRemoteMachineAndExecute.sh: this script transfers the project to the defined location on a remote machine, replaces the project-files there and executes the software it inside a screen, in order to not lose the execution in case the session is closed before the software finishes.
  2. transferToRemoteMachine.sh: this script just transfers the project to the defined location on a remote machine and replaces the project-files there.

Checking the logs

The log file is located inside the project's directory and has this name: "pdfCoverageEvaluator.log"