Commit Graph

313 Commits

Author SHA1 Message Date
Lampros Smyrnaios 69183a2e96 Make the S3-file-uploading operation more resilient to connection-issues, by increasing timeout-periods and implementing a retry-logic upon timeout-error. 2024-11-09 18:05:25 +02:00
Lampros Smyrnaios d1a4f07b0a Add Bulk-Import support for "wileymlImport" provenance. 2024-11-09 14:19:45 +02:00
Lampros Smyrnaios 7f8eebb564 - Use the provided contentType per bulkImport-provenance or per workerAggregated-file, when available, instead of examining the file-extension of each file.
- Set next version.
2024-11-09 14:18:03 +02:00
Lampros Smyrnaios f8050c4165 Set new version. 2024-10-30 20:31:01 +02:00
Lampros Smyrnaios 86fcd3f652 - Update docker configuration.
- Update jar-file creation.
- Comment-out the unused "spring-boot-devtools" which may be causing class reloading.
- Update dependencies.
2024-10-30 20:21:15 +02:00
Lampros Smyrnaios ac1bd6ba65 Fix stable versions not recognized in "Dockerfile". 2024-10-22 23:51:54 +03:00
Lampros Smyrnaios a9d64a2898 - Set new version.
- Update dependencies.
2024-10-22 21:25:36 +03:00
Lampros Smyrnaios 07cca846bb - Add support for EuropePMC Bulk-Import.
- Small code-optimization.
- Add error-handling.
2024-10-22 19:55:22 +03:00
Lampros Smyrnaios f279f502cd Update dependencies. 2024-10-15 00:57:12 +03:00
Lampros Smyrnaios b2fcde84e8 Merge pull request 'Don't base the use of id mappings on hardcoded provenance' (#3) from michal.politowski/UrlsController:no-hardcoded-provenance into master
Reviewed-on: #3
2024-10-09 19:25:50 +02:00
Lampros Smyrnaios 3f1e96e9f3 Update README.md 2024-09-19 13:02:11 +02:00
Michał Politowski be43e73e5d Don't base the use of id mappings on hardcoded provenance
There will be more sources that need id mappings.
Just use them if defined.
2024-09-18 09:52:49 +02:00
Lampros Smyrnaios b8b83e3d74 Fix not allowing file-paths to have dots "." or parentheses "(", ")" in the directories-part of the path. 2024-07-04 01:42:23 +03:00
Lampros Smyrnaios 7a8270c69f - Perform manual synchronization on "BulkImportReport.eventsMultimap", in order to avoid the "ConcurrentModificationException" when requesting a BulkImport-report.
- Prepare app-version for next release.
2024-07-04 01:35:31 +03:00
Lampros Smyrnaios 9e9f417f1f - Remove the unused "accessmode" column from the results returned by the "findAssignmentsQuery".
- Update dependencies.
- Code polishing.
2024-06-27 23:10:46 +03:00
Lampros Smyrnaios e46743bfba Improve speed of fulltext-collection by using a ranking system to prioritize Open and Unknown access publications, over Restricted, Embargoed and Closed access ones. 2024-06-26 02:18:34 +03:00
Lampros Smyrnaios 63cf63e6cc - Add a warn-log for duplicate files inside a file-segment, when bulk-importing.
- Add error-handling in "ScheduledTasks.extractAssignmentsCounterAndDeleteRelatedAssignmentRecords()".
- Improve an error-message.
2024-06-17 13:16:38 +03:00
Lampros Smyrnaios c45a172c21 - Add/improve log-messages.
- Set new version.
- Update dependencies.
- Code polishing.
2024-06-17 12:26:42 +03:00
Lampros Smyrnaios ab18ac5ff8 Add new prometheus metrics:
- averageFulltextsTransferSizeOfWorkerReports
- averageSuccessPercentageOfWorkerReports
2024-06-14 17:27:52 +03:00
Lampros Smyrnaios 7e7fc35d1e Add support for Springer-bulkImport. 2024-06-14 15:39:44 +03:00
Lampros Smyrnaios 0d63165b6d - Add checks to verify that there are active workers in the Service, before proceeding to try posting "(cancel)Shutdown" requests to all known workers.
- Add documentation in README.
2024-06-14 13:32:44 +03:00
Lampros Smyrnaios 3417e5c68c Add handling for "null" values in "BulkImport.BulkImportSource".
When setting an "application.yml" property to "null", then its value is registered as "empty". So, specifically for "BulkImport", where we do not want empty values, only nulls or valid values, convert the empty values to null, upon initialization.
2024-06-14 12:32:39 +03:00
Lampros Smyrnaios fc258e2e26 - Rename the "pdfUrlPrefix" config-field to "fulltextUrlPrefix", as it may point to different file-formats in the future.
- Code polishing.
2024-06-07 13:21:27 +03:00
Lampros Smyrnaios ed7bf09f9b - Replace all "json" usages, with "gson" ones, in order to remove the "org.json:json" dependency.
- Add an extra check to verify that the remote parquet directories are directories indeed.
- Set new version.
2024-06-06 14:40:39 +03:00
Lampros Smyrnaios 9610b77b2b Update "gradlew" script. 2024-06-03 13:15:13 +03:00
Lampros Smyrnaios 643e497826 - Fix a regression, in which the BulkImportReport was not returning the events in a time-ordered state.
- Update dependencies.
2024-06-03 13:07:17 +03:00
Lampros Smyrnaios a48f07f41d - Increase the "-Xmx" java argument to 6Gb.
- Set the version to 2.7.3.
2024-05-31 21:43:56 +03:00
Lampros Smyrnaios 2241a89452 - Limit the depth of subdirectories to process in BulkImport.
- Code polishing.
2024-05-31 21:36:03 +03:00
Lampros Smyrnaios edf064616a Improve error-handling in "BulkImportReport.getJsonReport()" and "FileUtils.writeToFile()". 2024-05-30 11:52:04 +03:00
Lampros Smyrnaios d7697ef3f8 - Tighten the thread-safety protection on the "BulkImportReport.getJsonReport()" method.
- Update dependencies.
- Code polishing.
2024-05-27 10:40:05 +03:00
Lampros Smyrnaios b6ad2af48b - Reduce occupying space at any given time, by deleting the archives right after decompression and files-extraction.
- Code refactoring.
2024-05-22 12:11:39 +03:00
Lampros Smyrnaios e2e7ca72d5 - Fix not allowing the user to use the "shutdownAllWorkersGracefully" endpoint twice.
- Code optimization.
- Update dependencies.
2024-05-21 23:43:49 +03:00
Lampros Smyrnaios 39c36f9e66 - Resolve a concurrency issue, by enforcing synchronization on the "BulkImportReport.getJsonReport()" method.
- Increase the number of stacktrace-lines to 20, for bulkImport-segment-failures.
- Improve "GenericUtils.getSelectiveStackTrace()".
2024-05-01 01:29:25 +03:00
Lampros Smyrnaios 8e14d4dbe0 - Fix not counting the files from the bulkImport-segment, which failed due to an exception.
- Write segment-exception-messages to the bulkImport-report.
2024-04-30 23:43:52 +03:00
Lampros Smyrnaios 0d117743c2 - Code-optimization.
- Upload the updated gradle-wrapper and set using the latest Gradle version in "installAndRun.sh" script.
2024-04-30 02:13:08 +03:00
Lampros Smyrnaios 64a1b7d4f0 - Comment-out the bucket-deletion process, in order to avoid any accidental deletion, even if it has to be explicitly allowed in the config.
- Update dependencies.
2024-04-26 12:54:00 +03:00
Lampros Smyrnaios e2d43a9af0 Upgrade the "processBulkImportedFilesSegment" code:
1) Pre-calculate the file-hashes for all files of the segment and perform a single "getHashLocationsQuery", instead of thousands
2) Write some important events to the bulkImportReport file, as soon as they are added in the list.
2024-04-02 14:35:19 +03:00
Lampros Smyrnaios bd323ad69a Avoid a very rare case, where we might get an "IllegalArgumentException" from "Lists.partition()", in case the "sizeOfUrlReports" is <= 3. 2024-03-29 18:12:52 +02:00
Lampros Smyrnaios 08de530f03 Various improvements:
- Handle the case when "fileUtils.constructS3FilenameAndUploadToS3()" returns "null", in "processBulkImportedFile()".
- Avoid an "IllegalArgumentException" in "Lists.partition()" when the number of files to bulkImport are fewer than the number of threads available to handle them.
- Include the last directory's "/" divider in the fileDIR group of "FILEPATH_ID_EXTENSION" regex (renamed from "FILENAME_ID_EXTENSION").
- Fix an incomplete log-message.
- Provide the "fileLocation" argument in the "DocFileData" constructor, in "processBulkImportedFile()", even though it's not used after.
2024-03-29 17:23:01 +02:00
Lampros Smyrnaios 1d821ed803 - Prepare version for next release.
- Fix typo of not using the "OpenAireID" in the S3 location of bulkImported files. Instead, the "fileNameID" was used, which in aggregation is the OpenAireID, but not in bulk-import.
- Update dependencies.
- Code polishing.
2024-03-28 06:09:28 +02:00
Lampros Smyrnaios 8bc5cc35e2 - Optimize writing to the Bulk-import-report file.
- Show the IP of the worker which posts a "workerShutdownReport".
- Code polishing.
2024-03-22 17:50:55 +02:00
Lampros Smyrnaios b9b29dd51c Move some code from "FileUtils.getAndUploadFullTexts()" to two separate methods. 2024-03-20 16:53:03 +02:00
Lampros Smyrnaios 56d233d38e - Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()".
- Fix a typo.
2024-03-20 15:25:19 +02:00
Lampros Smyrnaios 724eae1514 - Optimize the placement of "DatabaseConnector.databaseLock.unlock()" statements.
- Rename a maven-repository.
2024-03-20 15:08:01 +02:00
Lampros Smyrnaios 785204419d Update/cleanup the repositories in "build.gradle". 2024-03-15 12:22:13 +02:00
Lampros Smyrnaios 9b0818b535 - Add handling for additional/specific exceptions, when checking the "futures".
- Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()".
- Avoid a double log.
- Code polishing.
2024-03-14 13:59:23 +02:00
Lampros Smyrnaios b34417dc45 Optimize the test-DB creation process:
- Use views of the "initialDatabase" view and tables to a) reduce the amount of space used by test-DBs and b) improve test-db creation performance.
- Avoid possible failures from outdated metadata.
2024-03-14 13:10:54 +02:00
Lampros Smyrnaios f61cae41a1 - Try to get the cause of the exception of the callable-tasks which handle the bulk-import of fileSegments.
- Fix not counting the failedSegments when an exception was thrown.
- Code polishing.
2024-03-13 12:15:59 +02:00
Lampros Smyrnaios 8f9786de09 Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash:
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
2024-03-13 11:28:37 +02:00
Lampros Smyrnaios e4540e7f3c Handle the case when a urlReports-sublist does not have any payloads inside. 2024-03-12 14:25:00 +02:00