The Controller app of the PDF Aggregation Service.

Go to file

Lampros Smyrnaios b2fcde84e8 Merge pull request 'Don't base the use of id mappings on hardcoded provenance' (#3 ) from michal.politowski/UrlsController:no-hardcoded-provenance into master Reviewed-on: #3		2024-10-09 19:25:50 +02:00
gradle/wrapper	- Fix a regression, in which the BulkImportReport was not returning the events in a time-ordered state.	2024-06-03 13:07:17 +03:00
prometheus	- Improve the "shutdownController.sh" script.	2023-07-27 18:27:48 +03:00
src/main	Don't base the use of id mappings on hardcoded provenance	2024-09-18 09:52:49 +02:00
.gitignore	springified project	2022-01-30 22:15:13 +02:00
Dockerfile	- Increase the "-Xmx" java argument to 6Gb.	2024-05-31 21:43:56 +03:00
LICENSE	- Try to get the cause of the exception of the callable-tasks which handle parquet-files.	2024-02-07 18:34:28 +02:00
README.md	Update README.md	2024-09-19 13:02:11 +02:00
build.gradle	- Perform manual synchronization on "BulkImportReport.eventsMultimap", in order to avoid the "ConcurrentModificationException" when requesting a BulkImport-report.	2024-07-04 01:35:31 +03:00
docker-compose.yml	- Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query.	2023-06-22 12:39:11 +03:00
gradle.properties	- Improve some log-messages.	2022-11-30 16:28:39 +02:00
gradlew	Update "gradlew" script.	2024-06-03 13:15:13 +03:00
gradlew.bat	- Configure the destination of the logs in the "application.properties" file.	2024-02-08 19:47:34 +02:00
installAndRun.sh	- Fix a regression, in which the BulkImportReport was not returning the events in a time-ordered state.	2024-06-03 13:07:17 +03:00
settings.gradle	- Add the "isControllerAlive"-endpoint.	2021-09-23 15:08:52 +03:00
shutdownService.sh	- Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.	2023-10-23 17:19:29 +03:00

README.md

UrlsController

The Controller's Application receives requests coming from the Workers (deployed on the cloud), constructs an assignments-list with data received from a database and returns the list to the workers.
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.

It can also process Bulk-Import requests, from compatible data sources, in which case it receives the full-text files immediately, without offloading crawling jobs to Workers.

For managing and generating data, we use Impala JDBC and WebHDFS.

To install and run the application:

Run git clone and then cd UrlsController.
Set the preferable values inside the application.yml file. Specifically, for tests, set the services.pdfaggregation.controller.isTestEnvironment property to "true" and make sure the "services.pdfaggregation.controller.db.testDatabaseName" property is set to a test-database.
Execute the installAndRun.sh script which builds and runs the app.
If you want to just run the app, then run the script with the argument "1": ./installAndRun.sh 1.
If you want to build and run the app on a Docker Container, then run the script with the argument "0" followed by the argument "1": ./installAndRun.sh 0 1.
Additionally, if you want to test/visualize the exposed metrics on Prometheus and Grafana, you can deploy their instances on docker containers, by enabling the "runPrometheusAndGrafanaContainers" switch, inside the "./installAndRun.sh" script.

BulkImport API:

"bulkImportFullTexts" endpoint: http://<IP>:<PORT>/api/bulkImportFullTexts?provenance=<provenance>&bulkImportDir=<bulkImportDir>&shouldDeleteFilesOnFinish={true|false}
This endpoint loads the right configuration with the help of the "provenance" parameter, delegates the processing to a background thread and immediately returns a message with useful information, including the "reportFileID", which can be used at any moment to request a report about the progress of the bulk-import procedure. Use the HTTP POST method to access the endpoint.
The processing job starts running after 30-60 minutes and processes the full-texts files inside the given directory, in the following way: it generates the OpenAIRE-IDs, uploads the files to the S3 Object Store, generates and stores the "payload" records in the database. If it is requested by the user, it removes the successfully imported full-texts from the directory.
"getBulkImportReport" endpoint: http://<IP>:<PORT>/api/getBulkImportReport?id=<reportFileID>
This endpoint returns the bulkImport report, which corresponds to the given reportFileID, in JSON format.

How to add a bulk-import datasource:

Open the application.yml file.
Add a new object under the "bulk-import.bulkImportSources" property.
Read the comments written in the end of the "bulk-import" property and make sure all requirements are met.

Statistics API:

"getNumberOfAllPayloads" endpoint: http://<IP>:<PORT>/api/stats/getNumberOfAllPayloads
This endpoint returns the total number of payloads existing in the database, independently of the way they were aggregated. This includes the payloads created by other pieces of software, before the PDF-Aggregation-Service was created.
"getNumberOfPayloadsAggregatedByServiceThroughCrawling" endpoint: http://<IP>:<PORT>/api/stats/getNumberOfPayloadsAggregatedByServiceThroughCrawling
This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, through crawling.
"getNumberOfPayloadsAggregatedByServiceThroughBulkImport" endpoint: http://<IP>:<PORT>/api/stats/getNumberOfPayloadsAggregatedByServiceThroughBulkImport
This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, through bulk-import procedures, from compatible datasources.
"getNumberOfPayloadsAggregatedByService" endpoint: http://<IP>:<PORT>/api/stats/getNumberOfPayloadsAggregatedByService
This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, both through crawling and bulk-import procedures.
"getNumberOfLegacyPayloads" endpoint: http://<IP>:<PORT>/api/stats/getNumberOfLegacyPayloads
This endpoint returns the number of payloads which were aggregated by methods other than the PDF Aggregation Service.
"getNumberOfPayloadsForDatasource" endpoint: http://<IP>:<PORT>/api/stats/getNumberOfPayloadsForDatasource?datasourceId=<givenDatasourceId>
This endpoint returns the number of payloads which belong to the datasource specified by the given datasourceID.
"getNumberOfRecordsInspectedByServiceThroughCrawling" endpoint: http://<IP>:<PORT>/api/stats/getNumberOfRecordsInspectedByServiceThroughCrawling
This endpoint returns the number of records inspected by the PDF-Aggregation-Service through crawling.

Shutdown Service API:

"shutdownService" endpoint: http://localhost:<PORT>/api/shutdownService
This endpoint sends "shutdownWorker" requests to all the Workers which are actively participating in the Service. The Workers will shut down after finishing their work-in-progress and all full-texts have been either transferred to the Controller or deleted, in case an error has appeared.
Once the Workers are about to shut down, they send a "shutdownReport" to the Controller. A scheduling task runs in the Controller, every 2 hours, and if the user has specified that the Controller must shut down and all the Workers participating in the Service have shutdown, then it gently shuts down the Controller.
"cancelShutdownService" endpoint: http://localhost:<PORT>/api/cancelShutdownService
This endpoint specifies that the Controller will not shut down, and sends "cancelShutdownWorker" requests to all the Workers which are actively participating in the Service (have not shut down yet), so that they can continue to request assignments.
"shutdownAllWorkers" endpoint: http://localhost:<PORT>/api/shutdownAllWorkers
This endpoint sends "shutdownWorker" requests to all the Workers which are actively participating in the Service. The Workers will shut down after finishing their work-in-progress and all full-texts have been either transferred to the Controller or deleted, in case an error has appeared.
Once the Workers are about to shut down, they send a "shutdownReport" to the Controller.
This endpoint is helpful when we want to update only the Workers, while keeping the Service running for Bulk-import procedures.
"cancelShutdownAllWorkers" endpoint: http://localhost:<PORT>/api/cancelShutdownAllWorkers
This endpoint specifies that the Workers will not shut down, and sends "cancelShutdownWorker" requests to all the Workers which are actively participating in the Service (have not shut down yet), so that they can continue to request assignments.

Notes:

The Shutdown Service API is accessible by the Controller's host machine.
Use the HTTP POST method to access the endpoints.

Prometheus Metrics:

"numOfAllPayloads"
"numOfPayloadsAggregatedByServiceThroughCrawling"
"numOfPayloadsAggregatedByServiceThroughBulkImport"
"numOfPayloadsAggregatedByService"
"numOfLegacyPayloads"
"numOfRecordsInspectedByServiceThroughCrawling"
"averageFulltextsTransferSizeOfWorkerReports"
"averageSuccessPercentageOfWorkerReports"
"getAssignments_time_seconds_max": Time taken to return the assignments.
"addWorkerReport_time_seconds": Time taken to add the WorkerReport.

Implementation notes:

For transferring the full-text files, we use Facebook's Zstandard compression algorithm, which brings very big benefits in compression rate and speed.
The uploaded full-text files follow this naming-scheme: "datasourceID/recordID::fileHash.pdf"