e2d43a9af0Upgrade the "processBulkImportedFilesSegment" code: 1) Pre-calculate the file-hashes for all files of the segment and perform a single "getHashLocationsQuery", instead of thousands 2) Write some important events to the bulkImportReport file, as soon as they are added in the list.
master
Lampros Smyrnaios2024-04-02 14:35:19 +0300
bd323ad69aAvoid a very rare case, where we might get an "IllegalArgumentException" from "Lists.partition()", in case the "sizeOfUrlReports" is <= 3.Lampros Smyrnaios2024-03-29 18:12:52 +0200
08de530f03Various improvements: - Handle the case when "fileUtils.constructS3FilenameAndUploadToS3()" returns "null", in "processBulkImportedFile()". - Avoid an "IllegalArgumentException" in "Lists.partition()" when the number of files to bulkImport are fewer than the number of threads available to handle them. - Include the last directory's "/" divider in the fileDIR group of "FILEPATH_ID_EXTENSION" regex (renamed from "FILENAME_ID_EXTENSION"). - Fix an incomplete log-message. - Provide the "fileLocation" argument in the "DocFileData" constructor, in "processBulkImportedFile()", even though it's not used after.Lampros Smyrnaios2024-03-29 17:23:01 +0200
1d821ed803- Prepare version for next release. - Fix typo of not using the "OpenAireID" in the S3 location of bulkImported files. Instead, the "fileNameID" was used, which in aggregation is the OpenAireID, but not in bulk-import. - Update dependencies. - Code polishing.Lampros Smyrnaios2024-03-28 06:09:28 +0200
8bc5cc35e2- Optimize writing to the Bulk-import-report file. - Show the IP of the worker which posts a "workerShutdownReport". - Code polishing.Lampros Smyrnaios2024-03-22 17:50:55 +0200
b9b29dd51cMove some code from "FileUtils.getAndUploadFullTexts()" to two separate methods.Lampros Smyrnaios2024-03-20 16:53:03 +0200
56d233d38e- Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()". - Fix a typo.Lampros Smyrnaios2024-03-20 15:25:19 +0200
724eae1514- Optimize the placement of "DatabaseConnector.databaseLock.unlock()" statements. - Rename a maven-repository.Lampros Smyrnaios2024-03-20 15:08:01 +0200
9b0818b535- Add handling for additional/specific exceptions, when checking the "futures". - Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()". - Avoid a double log. - Code polishing.Lampros Smyrnaios2024-03-14 13:59:23 +0200
b34417dc45Optimize the test-DB creation process: - Use views of the "initialDatabase" view and tables to a) reduce the amount of space used by test-DBs and b) improve test-db creation performance. - Avoid possible failures from outdated metadata.Lampros Smyrnaios2024-03-14 13:10:54 +0200
f61cae41a1- Try to get the cause of the exception of the callable-tasks which handle the bulk-import of fileSegments. - Fix not counting the failedSegments when an exception was thrown. - Code polishing.Lampros Smyrnaios2024-03-13 12:15:59 +0200
8f9786de09Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash: - Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O. - Avoid checking multiple times the same fileHash, in case it is related with multiple payloads. - In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.Lampros Smyrnaios2024-03-13 11:28:37 +0200
e4540e7f3cHandle the case when a urlReports-sublist does not have any payloads inside.Lampros Smyrnaios2024-03-12 14:25:00 +0200
e20c5d2146- Add error-handling for the case when no payloads could be associated with a specific url which should have been in the hashMultiMap in "addUrlReportsByMatchingRecordsFromBacklog". - Fix not cloning the payload, before changing it and adding it in the "prefilledPayloads"-list; instead, an object-reference was used.Lampros Smyrnaios2024-03-11 19:48:04 +0200
1048463ca0- Improve error-handling in "S3ObjectStore.emptyBucket()". - Change some log-levels. - Code polishing.Lampros Smyrnaios2024-03-11 16:17:32 +0200
8f18008001Avoid performing payload-related operations in case no fulltext was received from the worker, due to en error.Lampros Smyrnaios2024-03-11 14:57:13 +0200
ce3e149a95Improve the "emptying/deleting" process of the S3-bucket.Lampros Smyrnaios2024-03-11 13:34:38 +0200
dd394f18a0- Optimize the JOIN-order in the "findAssignmentsQuery". - Optimize the "DOC_URL_FILTER"-regex. - Update dependencies.Lampros Smyrnaios2024-03-11 11:35:38 +0200
43ea64758d- Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past. - Show the number of files with problematic locations (if any of them exist). - Code polishing.Lampros Smyrnaios2024-02-23 12:39:28 +0200
b72996c9a9- Configure the destination of the logs in the "application.properties" file. - Add some gradle files to be used by Jenkins.Lampros Smyrnaios2024-02-08 19:47:34 +0200
3563fd6e2a- Try to get the cause of the exception of the callable-tasks which handle parquet-files. - Update License. - Update dependencies.Lampros Smyrnaios2024-02-07 18:34:28 +0200
5dadb8ad2f- Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group. - Remove an extra "File.separator" from the fulltexts-fullFilePath.Lampros Smyrnaios2024-01-19 15:46:23 +0200
bdc61c2cdaWhen at least one worker is still active and have to wait for service-shutdown, show a log-message to inform the user, including that worker's IP.
2.6.2
Lampros Smyrnaios2024-01-15 13:35:22 +0200
3a70b57146Prioritize the full-text urls over the landing-page ones.Lampros Smyrnaios2024-01-15 12:59:50 +0200
ee1ca8966b- Avoid continuing to request workerReport-batches when from the 1st batch, the base-directory of that assignments-counter is not found. - Update dependencies.Lampros Smyrnaios2024-01-15 12:57:33 +0200
2e60128084- Allow to easily change the por used by workers. - Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown. - Update dependencies. - Code polishing.
2.6.1
Lampros Smyrnaios2023-12-19 23:31:42 +0200
d90ad51609Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work.Lampros Smyrnaios2023-11-29 16:45:58 +0200
d20c9a7d2e- Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message. - Reduce the interval for deleting the unhandled assignments to once every 3 days. - Set the upcoming version. - Update dependencies.Lampros Smyrnaios2023-11-27 18:19:53 +0200
7f789b8ad0- If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck. - Code polishing.Lampros Smyrnaios2023-11-22 15:29:18 +0200
9b1f2c4931Improve performance and reduce memory usage of the "findAssignmentsQuery": - Reorder JOINs and predicates to reduce the computational cost. - Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.Lampros Smyrnaios2023-10-31 15:59:48 +0200
db929d8931- Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments. - Improve some log-messages. - Code polishing.Lampros Smyrnaios2023-10-30 12:29:54 +0200
856c62887d- Make sure the "UTF_8" charset is used, when we get a message from the response-body. - Improve some log-messages.Lampros Smyrnaios2023-10-26 11:44:23 +0300
bdf834c439- Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument. - Fix not updating the "UrlsController.numOfWorkers" correctly. - Code polishing.Lampros Smyrnaios2023-10-23 17:19:29 +0300
0c7bf6357b- Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()". - Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2.6
Lampros Smyrnaios2023-10-23 12:21:42 +0300
a7581335f1- Improve the "getDataForPayloadPrefillQuery". - Improve some error-messages.Lampros Smyrnaios2023-10-21 11:31:31 +0300
44c2fe7418- Fix the "IndexOutOfBoundsException", when checking the futures' results. - Update dependencies.Lampros Smyrnaios2023-10-20 14:25:05 +0300
df0ea62a5a- Handle the case when the "webHDFSBaseUrl" does not use HTTPS. - Improve error-reporting when uploading a file to HDFS.Lampros Smyrnaios2023-10-19 11:59:37 +0300
40729c6295Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method.Lampros Smyrnaios2023-10-17 12:50:51 +0300
fb2877dbe8Upgrade the execution system for the backgroundTasks: - Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again). - Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time. - Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist. - Improve the threads' shutdown procedure.Lampros Smyrnaios2023-10-09 17:23:59 +0300
0c79fdea35Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id.Lampros Smyrnaios2023-10-06 14:59:26 +0300
ebf8896005- Fix getter and setter methods for the "isAuthoritative" field. - Update Gradle.Lampros Smyrnaios2023-10-05 16:31:52 +0300
b2ce6393c1- Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service. - Code polishing.Lampros Smyrnaios2023-10-05 13:43:47 +0300
96c11ba4b8- Add a missing change. - Code optimization and polishing. - Update dependencies.Lampros Smyrnaios2023-10-04 16:17:12 +0300
7019f7c3c7Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads.Lampros Smyrnaios2023-10-04 15:43:31 +0300
b702cf4484Upgrade the "findAssignmentsQuery": - Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs. In the end, we only care about the urls when choosing which records should be aggregated. - Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.Lampros Smyrnaios2023-10-04 13:43:15 +0300
c9626de120Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled.Lampros Smyrnaios2023-10-04 13:01:13 +0300
865926fbc3- Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them. - Code polishing.Lampros Smyrnaios2023-10-02 15:46:55 +0300
ede7ca5a89- Add bulk-import support for non-Authoritative data-sources. - Update Spring Boot. - Code polishing.Lampros Smyrnaios2023-09-26 18:01:55 +0300
360731ba72- Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint. - Code optimization and polishing.Lampros Smyrnaios2023-09-14 13:53:01 +0300
02bae38885- Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed. - Optimize code-positioning for unlocking the DB when done executing queries.Lampros Smyrnaios2023-09-13 17:03:11 +0300
8fdb8e9137Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later.Lampros Smyrnaios2023-09-13 16:35:41 +0300
c98e8df323Move the "getRenamedWorkerReport"-code in its own method.Lampros Smyrnaios2023-09-13 16:27:18 +0300
6891c467d4- Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode. - Add a missing change for the optimization of reading files. - Update dependencies.Lampros Smyrnaios2023-09-13 15:29:30 +0300
3dd349dd00Improve the "findAssignmentsQuery": - Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower. - Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.Lampros Smyrnaios2023-09-13 14:38:15 +0300
ee2df19ce1- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint. - Add useful log-messages for various bulk-import stages and improve the current ones. - Optimize reading and writing the reports.Lampros Smyrnaios2023-09-11 17:24:39 +0300
1c8f3765ca- Fix not acquiring the full workerReport when retrying it, with the scheduler. - Improve error-handling in the "inspectWorkerReportsAndTakeAction" process. - Code polishing.Lampros Smyrnaios2023-09-08 14:59:48 +0300
e72a4d3d10- Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try. - Add check for workerReport-files, which may have been deleted before their time, due to an error.Lampros Smyrnaios2023-09-08 14:11:41 +0300
bd9245cc3dAvoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler.Lampros Smyrnaios2023-09-08 13:44:24 +0300
718f5cfefb- Improve prioritization of the most recent publications. - Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.Lampros Smyrnaios2023-09-07 14:05:58 +0300
199105f7f1Fix not writing some bulk-import error-messages to the logs. Instead, they were only written to the json-reports.Lampros Smyrnaios2023-09-04 16:33:27 +0300
acef891167Improve prioritization of "publication_boost" records, by adding a second ordering in the end.Lampros Smyrnaios2023-09-04 15:34:37 +0300
febe2b212cUpgrade management of failed workerReports: - Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed". - Avoid deleting immediately the failed workerReports. - Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap". - Add a scheduling task to process leftover failed workerReports from the current execution, regularly. - Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports. - Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted. - Code refactoring.
2.3
Lampros Smyrnaios2023-09-01 15:10:58 +0300
5c459a3a16Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()".Lampros Smyrnaios2023-08-31 13:20:12 +0300
601776e81c- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code) - Code polishing.Lampros Smyrnaios2023-08-30 17:07:51 +0300
c32dfa882eFix not deleting the assignment-records, for every workerReport, after processing it.Lampros Smyrnaios2023-08-30 16:22:58 +0300
aa3f32f3da- Make sure the given number of threads, given by the user is above zero. - Adjust the number and size of log files. - Update Spring Boot. - Code polishing.Lampros Smyrnaios2023-08-30 14:02:54 +0300
b3e0d214fdUpdate the BulkImport API: - Refactor the "bulkImportReportID". - Add the "bulk:" prefix in the provenance value, in the DB. - Fix not using correctly the "Lists.partition()" method. - Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error. - Fix the "numFailedSegments"-calculation. - Improve some messages. - Code polishing.Lampros Smyrnaios2023-08-21 18:19:53 +0300
a524375656- Create the HDFS-subDirs before generating "callableTasks" for creating and uploading the parquetFiles. - Delete gradle .zip file after installation.Lampros Smyrnaios2023-08-04 15:30:41 +0300
860c73ea91- Improve the "shutdownController.sh" script. - Set names for the Prometheus and Grafana containers. - Code polishing.
2.2
Lampros Smyrnaios2023-07-27 18:27:48 +0300
0699acc999Make sure we use the latest version of the "zstd-jni" library, where the core code for the "ZStandard" compression algorithm is. The Apache's "commons-compress" package which wraps it in a file-managements code, updates the "zstd-jni" less often.Lampros Smyrnaios2023-07-27 17:42:57 +0300
dfb9c8204eAdd useful messages for missing parameters in Stats API.Lampros Smyrnaios2023-07-25 15:36:54 +0300
b73be6d8daFix the Stats API returning simple numbers as "application/json". Now they are returned as "text/plain".Lampros Smyrnaios2023-07-25 12:03:27 +0300
66a5b3c7daUpdate Bulk-Import API: - Increase the "numOfThreadsPerBulkImportProcedure" to 6. - Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created. - Fix not returning the bulk-import-report as "application/json". - Add useful messages for missing parameters. - Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST". - Show a structured json-response for the "bulkImportFullTexts" endpoint. - Fix uncommon date-format. - Remove single quotes from json-report, since they are returned as bytes, not characters. - Optimize the generation of the json-bulkImport-report.Lampros Smyrnaios2023-07-25 11:59:47 +0300
8d8a387ff2Reduce the waiting time for new background tasks to be scheduled for processing.Lampros Smyrnaios2023-07-24 20:33:56 +0300
d821ae398fImprove performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations.Lampros Smyrnaios2023-07-24 20:28:41 +0300
7dc72e242e- Fix missing changes. - Change the HTTP-method of the renamed "test/uploadParquetFile" endpoint to "POST".Lampros Smyrnaios2023-07-24 19:55:37 +0300
9cbac77c2a- Add check for "shouldShutdownService" before allowing to continue with a bulk-import request. - Add check for remaining background tasks (including bulkImports), before checking if the workers have shut down and then shut down the Service.Lampros Smyrnaios2023-07-21 16:19:00 +0300
cec2531737- Increase the "numOfBackgroundThreads" to 8. - Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file. - Code polishing.Lampros Smyrnaios2023-07-21 11:45:50 +0300
fd1cf56863- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources. - Code polishing.Lampros Smyrnaios2023-07-19 18:31:24 +0300
b94c35c66e- Fix double active "@Scheduled" annotation for the "ScheduledTasks.updatePrometheusMetrics()" method. - Code polishing.Lampros Smyrnaios2023-07-13 18:32:45 +0300
8dfb58ee63Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers. Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.Lampros Smyrnaios2023-07-11 17:27:23 +0300
d5c139c410Handle the case where the "stats"-queries are executed while some table of the DB are in a "merge" state. In this case, the queries fail and the Controller retries up to 10 times.Lampros Smyrnaios2023-07-06 18:29:13 +0300