Lampros Smyrnaios
Improve error-handling in "BulkImportReport.getJsonReport()" and "FileUtils.writeToFile()".
2024-05-30 11:52:04 +03:00
Lampros Smyrnaios
- Tighten the thread-safety protection on the "BulkImportReport.getJsonReport()" method.
- Update dependencies.
- Code polishing.
2024-05-27 10:40:05 +03:00
Lampros Smyrnaios
- Reduce occupying space at any given time, by deleting the archives right after decompression and files-extraction.
- Code refactoring.
2024-05-22 12:11:39 +03:00
Lampros Smyrnaios
- Fix not allowing the user to use the "shutdownAllWorkersGracefully" endpoint twice.
- Code optimization.
- Update dependencies.
2024-05-21 23:43:49 +03:00
Lampros Smyrnaios
- Resolve a concurrency issue, by enforcing synchronization on the "BulkImportReport.getJsonReport()" method.
- Increase the number of stacktrace-lines to 20, for bulkImport-segment-failures.
- Improve "GenericUtils.getSelectiveStackTrace()".
2024-05-01 01:29:25 +03:00
Lampros Smyrnaios
- Fix not counting the files from the bulkImport-segment, which failed due to an exception.
- Write segment-exception-messages to the bulkImport-report.
2024-04-30 23:43:52 +03:00
Lampros Smyrnaios
- Code-optimization.
- Upload the updated gradle-wrapper and set using the latest Gradle version in "" script.
2024-04-30 02:13:08 +03:00
Lampros Smyrnaios
- Comment-out the bucket-deletion process, in order to avoid any accidental deletion, even if it has to be explicitly allowed in the config.
- Update dependencies.
2024-04-26 12:54:00 +03:00
Lampros Smyrnaios
Upgrade the "processBulkImportedFilesSegment" code:
1) Pre-calculate the file-hashes for all files of the segment and perform a single "getHashLocationsQuery", instead of thousands
2) Write some important events to the bulkImportReport file, as soon as they are added in the list.
2024-04-02 14:35:19 +03:00
Lampros Smyrnaios
Avoid a very rare case, where we might get an "IllegalArgumentException" from "Lists.partition()", in case the "sizeOfUrlReports" is <= 3.
2024-03-29 18:12:52 +02:00
Lampros Smyrnaios
Various improvements:
- Handle the case when "fileUtils.constructS3FilenameAndUploadToS3()" returns "null", in "processBulkImportedFile()".
- Avoid an "IllegalArgumentException" in "Lists.partition()" when the number of files to bulkImport are fewer than the number of threads available to handle them.
- Include the last directory's "/" divider in the fileDIR group of "FILEPATH_ID_EXTENSION" regex (renamed from "FILENAME_ID_EXTENSION").
- Fix an incomplete log-message.
- Provide the "fileLocation" argument in the "DocFileData" constructor, in "processBulkImportedFile()", even though it's not used after.
2024-03-29 17:23:01 +02:00
Lampros Smyrnaios
- Prepare version for next release.
- Fix typo of not using the "OpenAireID" in the S3 location of bulkImported files. Instead, the "fileNameID" was used, which in aggregation is the OpenAireID, but not in bulk-import.
- Update dependencies.
- Code polishing.
2024-03-28 06:09:28 +02:00
Lampros Smyrnaios
- Optimize writing to the Bulk-import-report file.
- Show the IP of the worker which posts a "workerShutdownReport".
- Code polishing.
2024-03-22 17:50:55 +02:00
Lampros Smyrnaios
Move some code from "FileUtils.getAndUploadFullTexts()" to two separate methods.
2024-03-20 16:53:03 +02:00
Lampros Smyrnaios
- Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()".
- Fix a typo.
2024-03-20 15:25:19 +02:00
Lampros Smyrnaios
- Optimize the placement of "DatabaseConnector.databaseLock.unlock()" statements.
- Rename a maven-repository.
2024-03-20 15:08:01 +02:00
Lampros Smyrnaios
Update/cleanup the repositories in "build.gradle".
2024-03-15 12:22:13 +02:00
Lampros Smyrnaios
- Add handling for additional/specific exceptions, when checking the "futures".
- Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()".
- Avoid a double log.
- Code polishing.
2024-03-14 13:59:23 +02:00
Lampros Smyrnaios
Optimize the test-DB creation process:
- Use views of the "initialDatabase" view and tables to a) reduce the amount of space used by test-DBs and b) improve test-db creation performance.
- Avoid possible failures from outdated metadata.
2024-03-14 13:10:54 +02:00
Lampros Smyrnaios
- Try to get the cause of the exception of the callable-tasks which handle the bulk-import of fileSegments.
- Fix not counting the failedSegments when an exception was thrown.
- Code polishing.
2024-03-13 12:15:59 +02:00
Lampros Smyrnaios
Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash:
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
2024-03-13 11:28:37 +02:00
Lampros Smyrnaios
Handle the case when a urlReports-sublist does not have any payloads inside.
2024-03-12 14:25:00 +02:00
Lampros Smyrnaios
- Add error-handling for the case when no payloads could be associated with a specific url which should have been in the hashMultiMap in "addUrlReportsByMatchingRecordsFromBacklog".
- Fix not cloning the payload, before changing it and adding it in the "prefilledPayloads"-list; instead, an object-reference was used.
2024-03-11 19:48:04 +02:00
Lampros Smyrnaios
- Improve error-handling in "S3ObjectStore.emptyBucket()".
- Change some log-levels.
- Code polishing.
2024-03-11 16:17:32 +02:00
Lampros Smyrnaios
Avoid performing payload-related operations in case no fulltext was received from the worker, due to en error.
2024-03-11 14:57:13 +02:00
Lampros Smyrnaios
Improve the "emptying/deleting" process of the S3-bucket.
2024-03-11 13:34:38 +02:00
Lampros Smyrnaios
- Optimize the JOIN-order in the "findAssignmentsQuery".
- Optimize the "DOC_URL_FILTER"-regex.
- Update dependencies.
2024-03-11 11:35:38 +02:00
Lampros Smyrnaios
- Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past.
- Show the number of files with problematic locations (if any of them exist).
- Code polishing.
2024-02-23 12:39:28 +02:00
Lampros Smyrnaios
Add the Jenkins' build-status badge in README.
2024-02-08 19:49:58 +02:00
Lampros Smyrnaios
- Configure the destination of the logs in the "" file.
- Add some gradle files to be used by Jenkins.
2024-02-08 19:47:34 +02:00
Lampros Smyrnaios
- Try to get the cause of the exception of the callable-tasks which handle parquet-files.
- Update License.
- Update dependencies.
2024-02-07 18:34:28 +02:00
Lampros Smyrnaios
Add/improve documentation.
2024-02-01 14:37:29 +02:00
Lampros Smyrnaios
- Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group.
- Remove an extra "File.separator" from the fulltexts-fullFilePath.
2024-01-19 15:46:23 +02:00
Lampros Smyrnaios
When at least one worker is still active and have to wait for service-shutdown, show a log-message to inform the user, including that worker's IP.
2024-01-15 13:35:22 +02:00
Lampros Smyrnaios
Prioritize the full-text urls over the landing-page ones.
2024-01-15 12:59:50 +02:00
Lampros Smyrnaios
- Avoid continuing to request workerReport-batches when from the 1st batch, the base-directory of that assignments-counter is not found.
- Update dependencies.
2024-01-15 12:57:33 +02:00
Lampros Smyrnaios
- Allow to easily change the por used by workers.
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios
Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work.
2023-11-29 16:45:58 +02:00
Lampros Smyrnaios
- Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message.
- Reduce the interval for deleting the unhandled assignments to once every 3 days.
- Set the upcoming version.
- Update dependencies.
2023-11-27 18:19:53 +02:00
Lampros Smyrnaios
- If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck.
- Code polishing.
2023-11-22 15:29:18 +02:00
Lampros Smyrnaios
Improve performance and reduce memory usage of the "findAssignmentsQuery":
- Reorder JOINs and predicates to reduce the computational cost.
- Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.
2023-10-31 15:59:48 +02:00
Lampros Smyrnaios
- Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios
- Make sure the "UTF_8" charset is used, when we get a message from the response-body.
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios
- Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.
- Fix not updating the "UrlsController.numOfWorkers" correctly.
- Code polishing.
2023-10-23 17:19:29 +03:00
Lampros Smyrnaios
- Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios
- Improve the "getDataForPayloadPrefillQuery".
- Improve some error-messages.
2023-10-21 11:31:31 +03:00
Lampros Smyrnaios
- Fix the "IndexOutOfBoundsException", when checking the futures' results.
- Update dependencies.
2023-10-20 14:25:05 +03:00
Lampros Smyrnaios
- Handle the case when the "webHDFSBaseUrl" does not use HTTPS.
- Improve error-reporting when uploading a file to HDFS.
2023-10-19 11:59:37 +03:00
Lampros Smyrnaios
Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method.
2023-10-17 12:50:51 +03:00
Lampros Smyrnaios
Improve the names of some methods.
2023-10-16 23:39:43 +03:00