UrlsController/README.md

# UrlsController
## [![Jenkins build status](https://jenkins-dnet.d4science.org/buildStatus/icon?job=UrlsController)](https://jenkins-dnet.d4science.org/job/UrlsController/)

The Controller's Application receives requests coming from the [**Workers**](https://code-repo.d4science.org/lsmyrnaios/UrlsWorker) (deployed on the cloud), constructs an assignments-list with data received from a database and returns the list to the workers.<br>
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.<br> 
<br>
It can also process **Bulk-Import** requests, from compatible data sources, in which case it receives the full-text files immediately, without offloading crawling jobs to Workers.<br>
<br>
For managing and generating data, we use [**Impala**](https://impala.apache.org/) JDBC and WebHDFS.<br>
<br>


**To install and run the application**:
- Run ```git clone``` and then ```cd UrlsController```.
- Set the preferable values inside the [__application.yml__](https://code-repo.d4science.org/lsmyrnaios/UrlsController/src/branch/master/src/main/resources/application.yml) file. Specifically, for tests, set the ***services.pdfaggregation.controller.isTestEnvironment*** property to "**true**" and make sure the "***services.pdfaggregation.controller.db.testDatabaseName***" property is set to a test-database.
- Execute the ```installAndRun.sh``` script which builds and runs the app.<br>
  If you want to just run the app, then run the script with the argument "1": ```./installAndRun.sh 1```.<br>
  If you want to build and run the app on a **Docker Container**, then run the script with the argument "0" followed by the argument "1": ```./installAndRun.sh 0 1```.<br>
  Additionally, if you want to test/visualize the exposed metrics on Prometheus and Grafana, you can deploy their instances on docker containers,
  by enabling the "runPrometheusAndGrafanaContainers" switch, inside the "./installAndRun.sh" script.<br>
  <br>


**BulkImport API**:
- "**bulkImportFullTexts**" endpoint: **http://\<IP\>:\<PORT\>/api/bulkImportFullTexts?provenance=\<provenance\>&bulkImportDir=\<bulkImportDir\>&shouldDeleteFilesOnFinish={true|false}** <br>
  This endpoint loads the right configuration with the help of the "provenance" parameter, delegates the processing to a background thread and immediately returns a message with useful information, including the "reportFileID", which can be used at any moment to request a report about the progress of the bulk-import procedure.<br>
  The processing job starts running after 30-60 minutes and processes the full-texts files inside the given directory, in the following way: it generates the OpenAIRE-IDs, uploads the files to the S3 Object Store, generates and stores the "payload" records in the database. If it is requested by the user, it removes the successfully imported full-texts from the directory.
- "**getBulkImportReport**" endpoint: **http://\<IP\>:\<PORT\>/api/getBulkImportReport?id=\<reportFileID\>** <br>
  This endpoint returns the bulkImport report, which corresponds to the given reportFileID, in JSON format.
<br>

**How to add a bulk-import datasource**:
- Open the [__application.yml__](https://code-repo.d4science.org/lsmyrnaios/UrlsController/src/branch/master/src/main/resources/application.yml) file.
- Add a new object under the "bulk-import.bulkImportSources" property.
- Read the comments written in the end of the "bulk-import" property and make sure all requirements are met. 
<br>
<br>

**Statistics API**:
- "**getNumberOfAllPayloads**" endpoint: **http://\<IP\>:\<PORT\>/api/stats/getNumberOfAllPayloads** <br>
  This endpoint returns the total number of payloads existing in the database, independently of the way they were aggregated. This includes the payloads created by other pieces of software, before the PDF-Aggregation-Service was created.
- "**getNumberOfPayloadsAggregatedByServiceThroughCrawling**" endpoint: **http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsAggregatedByServiceThroughCrawling** <br>
  This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, through crawling.
- "**getNumberOfPayloadsAggregatedByServiceThroughBulkImport**" endpoint: **http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsAggregatedByServiceThroughBulkImport** <br>
  This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, through bulk-import procedures, from compatible datasources.
- "**getNumberOfPayloadsAggregatedByService**" endpoint: **http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsAggregatedByService** <br>
  This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, both through crawling and bulk-import procedures.
- "**getNumberOfLegacyPayloads**" endpoint: **http://\<IP\>:\<PORT\>/api/stats/getNumberOfLegacyPayloads** <br>
  This endpoint returns the number of payloads which were aggregated by methods other than the PDF Aggregation Service.
- "**getNumberOfPayloadsForDatasource**" endpoint:  **http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsForDatasource?datasourceId=\<givenDatasourceId\>** <br>
  This endpoint returns the number of payloads which belong to the datasource specified by the given datasourceID.
- "**getNumberOfRecordsInspectedByServiceThroughCrawling**" endpoint: **http://\<IP\>:\<PORT\>/api/stats/getNumberOfRecordsInspectedByServiceThroughCrawling** <br>
  This endpoint returns the number of records inspected by the PDF-Aggregation-Service through crawling.
<br>
<br>

**Shutdown Service API**:
- "**shutdownService**" endpoint: **http://localhost:\<PORT\>/api/shutdownService** <br>
  This endpoint sends "shutdownWorker" requests to all the Workers which are actively participating in the Service. The Workers will shut down after finishing their work-in-progress and all full-texts have been either transferred to the Controller or deleted, in case an error has appeared.<br>
  Once the Workers are about to shut down, they send a "shutdownReport" to the Controller. A scheduling task runs in the Controller, every 2 hours, and if the user has specified that the Controller must shut down and all the Workers participating in the Service have shutdown, then it gently shuts down the Controller. 
- "**cancelShutdownService**" endpoint: **http://localhost:\<PORT\>/api/cancelShutdownService** <br>
  This endpoint specifies that the Controller will not shut down, and sends "cancelShutdownWorker" requests to all the Workers which are actively participating in the Service (have not shut down yet), so that they can continue to request assignments.
<br>

Note: The Shutdown Service API is accessible by the Controller's host machine.
<br>
<br>


**Prometheus Metrics**:
- "**numOfAllPayloads**"
- "**numOfPayloadsAggregatedByServiceThroughCrawling**"
- "**numOfPayloadsAggregatedByServiceThroughBulkImport**"
- "**numOfPayloadsAggregatedByService**"
- "**numOfLegacyPayloads**"
- "**numOfRecordsInspectedByServiceThroughCrawling**"
- "**getAssignments_time_seconds_max**": Time taken to return the assignments.
- "**addWorkerReport_time_seconds**": Time taken to add the WorkerReport.
<br>
<br>

Implementation notes:
- For transferring the full-text files, we use Facebook's [**Zstandard**](https://facebook.github.io/zstd/) compression algorithm, which brings very big benefits in compression rate and speed.
- The uploaded full-text files follow this naming-scheme: "**datasourceID/recordID::fileHash.pdf**"
- Optimize writing to the Bulk-import-report file. - Show the IP of the worker which posts a "workerShutdownReport". - Code polishing. 4 weeks ago			`# UrlsController`
			`## [![Jenkins build status](https://jenkins-dnet.d4science.org/buildStatus/icon?job=UrlsController)](https://jenkins-dnet.d4science.org/job/UrlsController/)`
Initial commit of UrlsController. 3 years ago
- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()". - Add/Improve some log messages. - Update and cleanup dependencies. - Code polishing. 10 months ago			`The Controller's Application receives requests coming from the [Workers](https://code-repo.d4science.org/lsmyrnaios/UrlsWorker) (deployed on the cloud), constructs an assignments-list with data received from a database and returns the list to the workers.<br>`
Update the README.md 2 years ago			`Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.<br>`
- Add the "getWorkersInfo" endpoint. - Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all. - Fix the detection of a different IP for a known worker. - Improve documentation. 11 months ago			`<br>`
			`It can also process Bulk-Import requests, from compatible data sources, in which case it receives the full-text files immediately, without offloading crawling jobs to Workers.<br>`
			`<br>`
- Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past. - Show the number of files with problematic locations (if any of them exist). - Code polishing. 2 months ago			`For managing and generating data, we use [Impala](https://impala.apache.org/) JDBC and WebHDFS.<br>`
- Add info about the Stats API usage in "README.md". - Optimize performance in "ParquetFileUtils.createAndLoadParquetDataIntoAttemptTable()" and "ParquetFileUtils.createAndLoadParquetDataIntoPayloadTable()". - Handle the "EmptyResultDataAccessException" inside "StatsController". - Optimize gradle's performance. - Code polishing. 1 year ago			`<br>`
Improve documentation. 1 year ago
Add/improve documentation. 3 months ago
			`To install and run the application:`
			- Run ```git clone``` and then ```cd UrlsController```.
			`- Set the preferable values inside the [__application.yml__](https://code-repo.d4science.org/lsmyrnaios/UrlsController/src/branch/master/src/main/resources/application.yml) file. Specifically, for tests, set the *services.pdfaggregation.controller.isTestEnvironment* property to "true" and make sure the "*services.pdfaggregation.controller.db.testDatabaseName*" property is set to a test-database.`
			- Execute the ```installAndRun.sh``` script which builds and runs the app.<br>
			If you want to just run the app, then run the script with the argument "1": ```./installAndRun.sh 1```.<br>
			If you want to build and run the app on a Docker Container, then run the script with the argument "0" followed by the argument "1": ```./installAndRun.sh 0 1```.<br>
			`Additionally, if you want to test/visualize the exposed metrics on Prometheus and Grafana, you can deploy their instances on docker containers,`
			`by enabling the "runPrometheusAndGrafanaContainers" switch, inside the "./installAndRun.sh" script.<br>`
			`<br>`


- Add documentation about the "BulkImport API" in the README. - Fix a link in README. - Update dependencies. 11 months ago			`BulkImport API:`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`- "bulkImportFullTexts" endpoint: http://\<IP\>:\<PORT\>/api/bulkImportFullTexts?provenance=\<provenance\>&bulkImportDir=\<bulkImportDir\>&shouldDeleteFilesOnFinish={true\|false} <br>`
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation. 10 months ago			`This endpoint loads the right configuration with the help of the "provenance" parameter, delegates the processing to a background thread and immediately returns a message with useful information, including the "reportFileID", which can be used at any moment to request a report about the progress of the bulk-import procedure.<br>`
			`The processing job starts running after 30-60 minutes and processes the full-texts files inside the given directory, in the following way: it generates the OpenAIRE-IDs, uploads the files to the S3 Object Store, generates and stores the "payload" records in the database. If it is requested by the user, it removes the successfully imported full-texts from the directory.`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`- "getBulkImportReport" endpoint: http://\<IP\>:\<PORT\>/api/getBulkImportReport?id=\<reportFileID\> <br>`
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation. 10 months ago			`This endpoint returns the bulkImport report, which corresponds to the given reportFileID, in JSON format.`
- Add documentation about the "BulkImport API" in the README. - Fix a link in README. - Update dependencies. 11 months ago			`<br>`
Add/improve documentation. 3 months ago
			`How to add a bulk-import datasource:`
			`- Open the [__application.yml__](https://code-repo.d4science.org/lsmyrnaios/UrlsController/src/branch/master/src/main/resources/application.yml) file.`
			`- Add a new object under the "bulk-import.bulkImportSources" property.`
			`- Read the comments written in the end of the "bulk-import" property and make sure all requirements are met.`
			`<br>`
- Add documentation about the "BulkImport API" in the README. - Fix a link in README. - Update dependencies. 11 months ago			`<br>`

Improve documentation. 1 year ago			`Statistics API:`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`- "getNumberOfAllPayloads" endpoint: http://\<IP\>:\<PORT\>/api/stats/getNumberOfAllPayloads <br>`
- Refactor the payloads-statistics-code and provide two endpoints: "getNumberOfPayloadsAggregatedByService", which returns the number of payloads aggregated only by the PDF-Aggregation-Service, and the "getNumberOfAllPayloads", which returns the number of all payloads existing in the database, even the ones aggregated in the past, by other pieces of software. - Update README.md. - Make sure the docker image is clean-built, by avoiding the use of cache. 1 year ago			`This endpoint returns the total number of payloads existing in the database, independently of the way they were aggregated. This includes the payloads created by other pieces of software, before the PDF-Aggregation-Service was created.`
Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one: - "numOfPayloadsAggregatedByServiceThroughCrawling" - "numOfPayloadsAggregatedByServiceThroughBulkImport" - "numOfPayloadsAggregatedByService" - "numOfLegacyPayloads" - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords") 10 months ago			`- "getNumberOfPayloadsAggregatedByServiceThroughCrawling" endpoint: http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsAggregatedByServiceThroughCrawling <br>`
			`This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, through crawling.`
			`- "getNumberOfPayloadsAggregatedByServiceThroughBulkImport" endpoint: http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsAggregatedByServiceThroughBulkImport <br>`
			`This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, through bulk-import procedures, from compatible datasources.`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`- "getNumberOfPayloadsAggregatedByService" endpoint: http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsAggregatedByService <br>`
Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one: - "numOfPayloadsAggregatedByServiceThroughCrawling" - "numOfPayloadsAggregatedByServiceThroughBulkImport" - "numOfPayloadsAggregatedByService" - "numOfLegacyPayloads" - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords") 10 months ago			`This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself, both through crawling and bulk-import procedures.`
			`- "getNumberOfLegacyPayloads" endpoint: http://\<IP\>:\<PORT\>/api/stats/getNumberOfLegacyPayloads <br>`
- Optimize the "insertAssignmentsQuery". - Add documentation about the Prometheus Metrics, in README. - Update Dependencies. - Code polishing. 10 months ago			`This endpoint returns the number of payloads which were aggregated by methods other than the PDF Aggregation Service.`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`- "getNumberOfPayloadsForDatasource" endpoint: http://\<IP\>:\<PORT\>/api/stats/getNumberOfPayloadsForDatasource?datasourceId=\<givenDatasourceId\> <br>`
Add the "getNumberOfPayloadsForDatasource" endpoint. 12 months ago			`This endpoint returns the number of payloads which belong to the datasource specified by the given datasourceID.`
Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one: - "numOfPayloadsAggregatedByServiceThroughCrawling" - "numOfPayloadsAggregatedByServiceThroughBulkImport" - "numOfPayloadsAggregatedByService" - "numOfLegacyPayloads" - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords") 10 months ago			`- "getNumberOfRecordsInspectedByServiceThroughCrawling" endpoint: http://\<IP\>:\<PORT\>/api/stats/getNumberOfRecordsInspectedByServiceThroughCrawling <br>`
			`This endpoint returns the number of records inspected by the PDF-Aggregation-Service through crawling.`
- Add info about the Stats API usage in "README.md". - Optimize performance in "ParquetFileUtils.createAndLoadParquetDataIntoAttemptTable()" and "ParquetFileUtils.createAndLoadParquetDataIntoPayloadTable()". - Handle the "EmptyResultDataAccessException" inside "StatsController". - Optimize gradle's performance. - Code polishing. 1 year ago			`<br>`
- Implement the "getAndUploadFullTexts" functionality. In order to access the S3-ObjectStore from one trusted place, the Controller will request the files from the workers and upload them on S3. Afterwards, the workers will delete those files from their local storage. Previously, each worker uploaded its own files. - Move the "mergeParquetFiles" and "getCutBatchExceptionMessage" methods inside the "FileUtils" class. - Code cleanup. 2 years ago			`<br>`
Improve documentation. 1 year ago
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`Shutdown Service API:`
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation. 10 months ago			`- "shutdownService" endpoint: http://localhost:\<PORT\>/api/shutdownService <br>`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`This endpoint sends "shutdownWorker" requests to all the Workers which are actively participating in the Service. The Workers will shut down after finishing their work-in-progress and all full-texts have been either transferred to the Controller or deleted, in case an error has appeared.<br>`
			`Once the Workers are about to shut down, they send a "shutdownReport" to the Controller. A scheduling task runs in the Controller, every 2 hours, and if the user has specified that the Controller must shut down and all the Workers participating in the Service have shutdown, then it gently shuts down the Controller.`
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation. 10 months ago			`- "cancelShutdownService" endpoint: http://localhost:\<PORT\>/api/cancelShutdownService <br>`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago			`This endpoint specifies that the Controller will not shut down, and sends "cancelShutdownWorker" requests to all the Workers which are actively participating in the Service (have not shut down yet), so that they can continue to request assignments.`
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation. 10 months ago			`<br>`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation. 10 months ago			`Note: The Shutdown Service API is accessible by the Controller's host machine.`
			`<br>`
			`<br>`
- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. 11 months ago
- Optimize the "insertAssignmentsQuery". - Add documentation about the Prometheus Metrics, in README. - Update Dependencies. - Code polishing. 10 months ago
			`Prometheus Metrics:`
			`- "numOfAllPayloads"`
			`- "numOfPayloadsAggregatedByServiceThroughCrawling"`
			`- "numOfPayloadsAggregatedByServiceThroughBulkImport"`
			`- "numOfPayloadsAggregatedByService"`
			`- "numOfLegacyPayloads"`
			`- "numOfRecordsInspectedByServiceThroughCrawling"`
			`- "getAssignments_time_seconds_max": Time taken to return the assignments.`
			`- "addWorkerReport_time_seconds": Time taken to add the WorkerReport.`
			`<br>`
			`<br>`

- Use Facebook's [Zstandard](https://facebook.github.io/zstd/) compression algorithm, which brings very big benefits on compression rate and speed. - Update the minIO dependency. - Code polishing. 1 year ago			`Implementation notes:`
			`- For transferring the full-text files, we use Facebook's [Zstandard](https://facebook.github.io/zstd/) compression algorithm, which brings very big benefits in compression rate and speed.`
- Add the "getWorkersInfo" endpoint. - Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all. - Fix the detection of a different IP for a known worker. - Improve documentation. 11 months ago			`- The uploaded full-text files follow this naming-scheme: "datasourceID/recordID::fileHash.pdf"`