Commit Graph

  • 0673043864 - Improve text-layout in README. - Improve error-handling. - Update Gradle and optimize its properties. - Set next version. - Code polishing. master Lampros Smyrnaios 2024-12-19 23:00:05 +0200
  • 2cc1c3574a Set new version. 2.8.4 Lampros Smyrnaios 2024-11-11 19:29:47 +0200
  • bd5802f1d2 - Try to fix value-reset of "shouldShutdownService"-variable, after setting it to "true", by a) changing its access to be done through Singleton Bean (a @Service class) and b) using synchronization. - Update Gradle and optimize its settings. Lampros Smyrnaios 2024-11-11 19:27:37 +0200
  • 69183a2e96 Make the S3-file-uploading operation more resilient to connection-issues, by increasing timeout-periods and implementing a retry-logic upon timeout-error. Lampros Smyrnaios 2024-11-09 18:05:25 +0200
  • d1a4f07b0a Add Bulk-Import support for "wileymlImport" provenance. Lampros Smyrnaios 2024-11-09 14:19:45 +0200
  • 7f8eebb564 - Use the provided contentType per bulkImport-provenance or per workerAggregated-file, when available, instead of examining the file-extension of each file. - Set next version. Lampros Smyrnaios 2024-11-09 14:18:03 +0200
  • f8050c4165 Set new version. 2.8.3 Lampros Smyrnaios 2024-10-30 20:31:01 +0200
  • 86fcd3f652 - Update docker configuration. - Update jar-file creation. - Comment-out the unused "spring-boot-devtools" which may be causing class reloading. - Update dependencies. Lampros Smyrnaios 2024-10-30 20:21:15 +0200
  • ac1bd6ba65 Fix stable versions not recognized in "Dockerfile". 2.8.2 Lampros Smyrnaios 2024-10-22 23:51:54 +0300
  • a9d64a2898 - Set new version. - Update dependencies. Lampros Smyrnaios 2024-10-22 21:25:36 +0300
  • 07cca846bb - Add support for EuropePMC Bulk-Import. - Small code-optimization. - Add error-handling. Lampros Smyrnaios 2024-10-22 19:55:22 +0300
  • f279f502cd Update dependencies. 2.8.1 Lampros Smyrnaios 2024-10-15 00:57:12 +0300
  • b2fcde84e8 Merge pull request 'Don't base the use of id mappings on hardcoded provenance' (#3) from michal.politowski/UrlsController:no-hardcoded-provenance into master Lampros Smyrnaios 2024-10-09 19:25:50 +0200
  • 3f1e96e9f3 Update README.md Lampros Smyrnaios 2024-09-19 13:02:11 +0200
  • be43e73e5d Don't base the use of id mappings on hardcoded provenance #3 Michał Politowski 2024-09-18 09:52:49 +0200
  • b8b83e3d74 Fix not allowing file-paths to have dots "." or parentheses "(", ")" in the directories-part of the path. Lampros Smyrnaios 2024-07-04 01:42:23 +0300
  • 7a8270c69f - Perform manual synchronization on "BulkImportReport.eventsMultimap", in order to avoid the "ConcurrentModificationException" when requesting a BulkImport-report. - Prepare app-version for next release. Lampros Smyrnaios 2024-07-04 01:35:31 +0300
  • 9e9f417f1f - Remove the unused "accessmode" column from the results returned by the "findAssignmentsQuery". - Update dependencies. - Code polishing. 2.8.0 Lampros Smyrnaios 2024-06-27 23:10:46 +0300
  • e46743bfba Improve speed of fulltext-collection by using a ranking system to prioritize Open and Unknown access publications, over Restricted, Embargoed and Closed access ones. Lampros Smyrnaios 2024-06-26 02:18:34 +0300
  • 63cf63e6cc - Add a warn-log for duplicate files inside a file-segment, when bulk-importing. - Add error-handling in "ScheduledTasks.extractAssignmentsCounterAndDeleteRelatedAssignmentRecords()". - Improve an error-message. Lampros Smyrnaios 2024-06-17 13:16:38 +0300
  • c45a172c21 - Add/improve log-messages. - Set new version. - Update dependencies. - Code polishing. Lampros Smyrnaios 2024-06-17 12:26:42 +0300
  • ab18ac5ff8 Add new prometheus metrics: - averageFulltextsTransferSizeOfWorkerReports - averageSuccessPercentageOfWorkerReports Lampros Smyrnaios 2024-06-14 17:27:52 +0300
  • 7e7fc35d1e Add support for Springer-bulkImport. Lampros Smyrnaios 2024-06-14 15:39:44 +0300
  • 0d63165b6d - Add checks to verify that there are active workers in the Service, before proceeding to try posting "(cancel)Shutdown" requests to all known workers. - Add documentation in README. Lampros Smyrnaios 2024-06-14 13:32:44 +0300
  • 3417e5c68c Add handling for "null" values in "BulkImport.BulkImportSource". When setting an "application.yml" property to "null", then its value is registered as "empty". So, specifically for "BulkImport", where we do not want empty values, only nulls or valid values, convert the empty values to null, upon initialization. Lampros Smyrnaios 2024-06-14 12:32:39 +0300
  • fc258e2e26 - Rename the "pdfUrlPrefix" config-field to "fulltextUrlPrefix", as it may point to different file-formats in the future. - Code polishing. Lampros Smyrnaios 2024-06-07 13:21:27 +0300
  • ed7bf09f9b - Replace all "json" usages, with "gson" ones, in order to remove the "org.json:json" dependency. - Add an extra check to verify that the remote parquet directories are directories indeed. - Set new version. Lampros Smyrnaios 2024-06-06 14:40:39 +0300
  • 9610b77b2b Update "gradlew" script. 2.7.3 Lampros Smyrnaios 2024-06-03 13:15:13 +0300
  • 643e497826 - Fix a regression, in which the BulkImportReport was not returning the events in a time-ordered state. - Update dependencies. Lampros Smyrnaios 2024-06-03 13:07:17 +0300
  • a48f07f41d - Increase the "-Xmx" java argument to 6Gb. - Set the version to 2.7.3. Lampros Smyrnaios 2024-05-31 21:43:56 +0300
  • 2241a89452 - Limit the depth of subdirectories to process in BulkImport. - Code polishing. Lampros Smyrnaios 2024-05-31 21:36:03 +0300
  • edf064616a Improve error-handling in "BulkImportReport.getJsonReport()" and "FileUtils.writeToFile()". Lampros Smyrnaios 2024-05-30 11:52:04 +0300
  • d7697ef3f8 - Tighten the thread-safety protection on the "BulkImportReport.getJsonReport()" method. - Update dependencies. - Code polishing. 2.7.2 Lampros Smyrnaios 2024-05-27 10:40:05 +0300
  • b6ad2af48b - Reduce occupying space at any given time, by deleting the archives right after decompression and files-extraction. - Code refactoring. 2.7.1 Lampros Smyrnaios 2024-05-22 12:11:39 +0300
  • e2e7ca72d5 - Fix not allowing the user to use the "shutdownAllWorkersGracefully" endpoint twice. - Code optimization. - Update dependencies. Lampros Smyrnaios 2024-05-21 23:43:49 +0300
  • 39c36f9e66 - Resolve a concurrency issue, by enforcing synchronization on the "BulkImportReport.getJsonReport()" method. - Increase the number of stacktrace-lines to 20, for bulkImport-segment-failures. - Improve "GenericUtils.getSelectiveStackTrace()". Lampros Smyrnaios 2024-05-01 01:29:25 +0300
  • 8e14d4dbe0 - Fix not counting the files from the bulkImport-segment, which failed due to an exception. - Write segment-exception-messages to the bulkImport-report. Lampros Smyrnaios 2024-04-30 23:43:52 +0300
  • 0d117743c2 - Code-optimization. - Upload the updated gradle-wrapper and set using the latest Gradle version in "installAndRun.sh" script. Lampros Smyrnaios 2024-04-30 02:13:08 +0300
  • 64a1b7d4f0 - Comment-out the bucket-deletion process, in order to avoid any accidental deletion, even if it has to be explicitly allowed in the config. - Update dependencies. 2.7.0 Lampros Smyrnaios 2024-04-26 12:54:00 +0300
  • e2d43a9af0 Upgrade the "processBulkImportedFilesSegment" code: 1) Pre-calculate the file-hashes for all files of the segment and perform a single "getHashLocationsQuery", instead of thousands 2) Write some important events to the bulkImportReport file, as soon as they are added in the list. Lampros Smyrnaios 2024-04-02 14:35:19 +0300
  • bd323ad69a Avoid a very rare case, where we might get an "IllegalArgumentException" from "Lists.partition()", in case the "sizeOfUrlReports" is <= 3. Lampros Smyrnaios 2024-03-29 18:12:52 +0200
  • 08de530f03 Various improvements: - Handle the case when "fileUtils.constructS3FilenameAndUploadToS3()" returns "null", in "processBulkImportedFile()". - Avoid an "IllegalArgumentException" in "Lists.partition()" when the number of files to bulkImport are fewer than the number of threads available to handle them. - Include the last directory's "/" divider in the fileDIR group of "FILEPATH_ID_EXTENSION" regex (renamed from "FILENAME_ID_EXTENSION"). - Fix an incomplete log-message. - Provide the "fileLocation" argument in the "DocFileData" constructor, in "processBulkImportedFile()", even though it's not used after. Lampros Smyrnaios 2024-03-29 17:23:01 +0200
  • 1d821ed803 - Prepare version for next release. - Fix typo of not using the "OpenAireID" in the S3 location of bulkImported files. Instead, the "fileNameID" was used, which in aggregation is the OpenAireID, but not in bulk-import. - Update dependencies. - Code polishing. Lampros Smyrnaios 2024-03-28 06:09:28 +0200
  • 8bc5cc35e2 - Optimize writing to the Bulk-import-report file. - Show the IP of the worker which posts a "workerShutdownReport". - Code polishing. Lampros Smyrnaios 2024-03-22 17:50:55 +0200
  • b9b29dd51c Move some code from "FileUtils.getAndUploadFullTexts()" to two separate methods. Lampros Smyrnaios 2024-03-20 16:53:03 +0200
  • 56d233d38e - Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()". - Fix a typo. Lampros Smyrnaios 2024-03-20 15:25:19 +0200
  • 724eae1514 - Optimize the placement of "DatabaseConnector.databaseLock.unlock()" statements. - Rename a maven-repository. Lampros Smyrnaios 2024-03-20 15:08:01 +0200
  • 785204419d Update/cleanup the repositories in "build.gradle". Lampros Smyrnaios 2024-03-15 12:22:13 +0200
  • 9b0818b535 - Add handling for additional/specific exceptions, when checking the "futures". - Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()". - Avoid a double log. - Code polishing. Lampros Smyrnaios 2024-03-14 13:59:23 +0200
  • b34417dc45 Optimize the test-DB creation process: - Use views of the "initialDatabase" view and tables to a) reduce the amount of space used by test-DBs and b) improve test-db creation performance. - Avoid possible failures from outdated metadata. Lampros Smyrnaios 2024-03-14 13:10:54 +0200
  • f61cae41a1 - Try to get the cause of the exception of the callable-tasks which handle the bulk-import of fileSegments. - Fix not counting the failedSegments when an exception was thrown. - Code polishing. Lampros Smyrnaios 2024-03-13 12:15:59 +0200
  • 8f9786de09 Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash: - Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O. - Avoid checking multiple times the same fileHash, in case it is related with multiple payloads. - In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts. Lampros Smyrnaios 2024-03-13 11:28:37 +0200
  • e4540e7f3c Handle the case when a urlReports-sublist does not have any payloads inside. Lampros Smyrnaios 2024-03-12 14:25:00 +0200
  • e20c5d2146 - Add error-handling for the case when no payloads could be associated with a specific url which should have been in the hashMultiMap in "addUrlReportsByMatchingRecordsFromBacklog". - Fix not cloning the payload, before changing it and adding it in the "prefilledPayloads"-list; instead, an object-reference was used. Lampros Smyrnaios 2024-03-11 19:48:04 +0200
  • 1048463ca0 - Improve error-handling in "S3ObjectStore.emptyBucket()". - Change some log-levels. - Code polishing. Lampros Smyrnaios 2024-03-11 16:17:32 +0200
  • 8f18008001 Avoid performing payload-related operations in case no fulltext was received from the worker, due to en error. Lampros Smyrnaios 2024-03-11 14:57:13 +0200
  • ce3e149a95 Improve the "emptying/deleting" process of the S3-bucket. Lampros Smyrnaios 2024-03-11 13:34:38 +0200
  • dd394f18a0 - Optimize the JOIN-order in the "findAssignmentsQuery". - Optimize the "DOC_URL_FILTER"-regex. - Update dependencies. Lampros Smyrnaios 2024-03-11 11:35:38 +0200
  • 43ea64758d - Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past. - Show the number of files with problematic locations (if any of them exist). - Code polishing. Lampros Smyrnaios 2024-02-23 12:39:28 +0200
  • 749172edd8 Add the Jenkins' build-status badge in README. Lampros Smyrnaios 2024-02-08 19:49:58 +0200
  • b72996c9a9 - Configure the destination of the logs in the "application.properties" file. - Add some gradle files to be used by Jenkins. Lampros Smyrnaios 2024-02-08 19:47:34 +0200
  • 3563fd6e2a - Try to get the cause of the exception of the callable-tasks which handle parquet-files. - Update License. - Update dependencies. Lampros Smyrnaios 2024-02-07 18:34:28 +0200
  • 34d7a143e7 Add/improve documentation. Lampros Smyrnaios 2024-02-01 14:37:29 +0200
  • 5dadb8ad2f - Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group. - Remove an extra "File.separator" from the fulltexts-fullFilePath. Lampros Smyrnaios 2024-01-19 15:46:23 +0200
  • bdc61c2cda When at least one worker is still active and have to wait for service-shutdown, show a log-message to inform the user, including that worker's IP. 2.6.2 Lampros Smyrnaios 2024-01-15 13:35:22 +0200
  • 3a70b57146 Prioritize the full-text urls over the landing-page ones. Lampros Smyrnaios 2024-01-15 12:59:50 +0200
  • ee1ca8966b - Avoid continuing to request workerReport-batches when from the 1st batch, the base-directory of that assignments-counter is not found. - Update dependencies. Lampros Smyrnaios 2024-01-15 12:57:33 +0200
  • 2e60128084 - Allow to easily change the por used by workers. - Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown. - Update dependencies. - Code polishing. 2.6.1 Lampros Smyrnaios 2023-12-19 23:31:42 +0200
  • d90ad51609 Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work. Lampros Smyrnaios 2023-11-29 16:45:58 +0200
  • d20c9a7d2e - Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message. - Reduce the interval for deleting the unhandled assignments to once every 3 days. - Set the upcoming version. - Update dependencies. Lampros Smyrnaios 2023-11-27 18:19:53 +0200
  • 7f789b8ad0 - If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck. - Code polishing. Lampros Smyrnaios 2023-11-22 15:29:18 +0200
  • 9b1f2c4931 Improve performance and reduce memory usage of the "findAssignmentsQuery": - Reorder JOINs and predicates to reduce the computational cost. - Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore. Lampros Smyrnaios 2023-10-31 15:59:48 +0200
  • db929d8931 - Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments. - Improve some log-messages. - Code polishing. Lampros Smyrnaios 2023-10-30 12:29:54 +0200
  • 856c62887d - Make sure the "UTF_8" charset is used, when we get a message from the response-body. - Improve some log-messages. Lampros Smyrnaios 2023-10-26 11:44:23 +0300
  • bdf834c439 - Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument. - Fix not updating the "UrlsController.numOfWorkers" correctly. - Code polishing. Lampros Smyrnaios 2023-10-23 17:19:29 +0300
  • 0c7bf6357b - Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()". - Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage). 2.6 Lampros Smyrnaios 2023-10-23 12:21:42 +0300
  • a7581335f1 - Improve the "getDataForPayloadPrefillQuery". - Improve some error-messages. Lampros Smyrnaios 2023-10-21 11:31:31 +0300
  • 44c2fe7418 - Fix the "IndexOutOfBoundsException", when checking the futures' results. - Update dependencies. Lampros Smyrnaios 2023-10-20 14:25:05 +0300
  • df0ea62a5a - Handle the case when the "webHDFSBaseUrl" does not use HTTPS. - Improve error-reporting when uploading a file to HDFS. Lampros Smyrnaios 2023-10-19 11:59:37 +0300
  • 40729c6295 Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method. Lampros Smyrnaios 2023-10-17 12:50:51 +0300
  • f05eee7569 Improve the names of some methods. Lampros Smyrnaios 2023-10-16 23:39:43 +0300
  • def21b991d Improve the UX of the "installAndRun.sh" script. Lampros Smyrnaios 2023-10-09 17:28:22 +0300
  • fb2877dbe8 Upgrade the execution system for the backgroundTasks: - Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again). - Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time. - Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist. - Improve the threads' shutdown procedure. Lampros Smyrnaios 2023-10-09 17:23:59 +0300
  • a354da763d - Improve some log-messages. - Increase app's version. - Code polishing. Lampros Smyrnaios 2023-10-06 17:28:54 +0300
  • 0c79fdea35 Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id. Lampros Smyrnaios 2023-10-06 14:59:26 +0300
  • ebf8896005 - Fix getter and setter methods for the "isAuthoritative" field. - Update Gradle. Lampros Smyrnaios 2023-10-05 16:31:52 +0300
  • b2ce6393c1 - Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service. - Code polishing. Lampros Smyrnaios 2023-10-05 13:43:47 +0300
  • 96c11ba4b8 - Add a missing change. - Code optimization and polishing. - Update dependencies. Lampros Smyrnaios 2023-10-04 16:17:12 +0300
  • 7019f7c3c7 Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads. Lampros Smyrnaios 2023-10-04 15:43:31 +0300
  • b702cf4484 Upgrade the "findAssignmentsQuery": - Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs. In the end, we only care about the urls when choosing which records should be aggregated. - Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations. Lampros Smyrnaios 2023-10-04 13:43:15 +0300
  • c9626de120 Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled. Lampros Smyrnaios 2023-10-04 13:01:13 +0300
  • 865926fbc3 - Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them. - Code polishing. Lampros Smyrnaios 2023-10-02 15:46:55 +0300
  • ede7ca5a89 - Add bulk-import support for non-Authoritative data-sources. - Update Spring Boot. - Code polishing. Lampros Smyrnaios 2023-09-26 18:01:55 +0300
  • 90a864ea61 Add more info in bulk-import logs. Lampros Smyrnaios 2023-09-20 17:50:10 +0300
  • 0f5d4dac78 Check and show warning/error message for failed payloads. Lampros Smyrnaios 2023-09-20 17:38:22 +0300
  • 068b97dd60 Set Xms and Xmx Java-parameters when running the Jar, in Docker. 2.5 Lampros Smyrnaios 2023-09-15 14:19:46 +0300
  • 903c3e1ffc Add thread-safety when reading the bulkImportReport-files. Lampros Smyrnaios 2023-09-15 11:54:32 +0300
  • 846c53913f Add LICENSE. Lampros Smyrnaios 2023-09-14 16:05:36 +0300
  • 360731ba72 - Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint. - Code optimization and polishing. Lampros Smyrnaios 2023-09-14 13:53:01 +0300
  • b4f91f188e Fix the "retries-num" appearing in log-messages. Lampros Smyrnaios 2023-09-14 12:08:33 +0300