Go to file
giosullutrone 3d0c931f95 Fixed errors as per email 6/11 2025-11-06 17:05:02 +01:00
Project_maktaba_v2 Fixed errors as per email 6/11 2025-11-06 17:05:02 +01:00
docs Updated readme and docs 2025-10-31 17:12:58 +01:00
.DS_Store Fixed auth and db 2025-11-03 10:42:57 +01:00
Dockerfile Fixed roles to work with D4Science 2025-10-09 09:41:18 +02:00
README.md Updated readme and docs 2025-10-31 17:12:58 +01:00
docker-compose.yml Update docker-compose.yml 2025-10-31 08:15:05 +01:00

README.md

Digital Maktaba Tool

Author: DataRiver S.r.l., Modena, Italy. Date: 31/10/2025
License: GNU General Public License v3.0


Table of Contents


Introduction

This repository hosts a digital library for documents written in non-Latin scripts, developed within ITSERR WP5 (Digital Maktaba). It streamlines the end-to-end workflow for librarians and researchers: upload → OCR (Arabic-centric) → metadata extraction → semi-automatic cataloguing → search/browse.

Features

  • Arabic OCR pipeline (CPU): batch processing of PDFs and images with post-processing helpers.
  • Metadata extraction & suggestions: front-matter cues and keyword-based category suggestions.
  • Cataloguing UI: librarian-friendly editing/validation of records.
  • Search & browse: library page with full-text/filters and document detail view.
  • Roles & permissions: Librarian (ingest/catalogue) and Researcher (search/consult).

Prerequisites

  • Runtime (dev): Python ≥ 3.10 (venv recommended)
  • Deploy: Docker + Docker Compose
  • Auth: Keycloak (via D4Science, OIDC client configured)
  • Database: PostgreSQL (containerized via Compose or external)

Core Python libraries (see requirements.txt for the full list): NiceGUI, EasyOCR (PyTorch, torchvision), Pillow, PyMuPDF/fitz, pdf2image, NLTK, langid, Tashaphyne + arramooz-pysqlite (Arabic NLP), APScheduler, psycopg2-binary (PostgreSQL).

Installation

  1. Clone this repository.
       git clone https://code-repo.d4science.org/Resilience/wp5_digital_maktaba 
       cd wp5_digital_maktaba
    
  2. Create a .env file in the project root (example values below):
    # App
    APP_BASE_URL=http://localhost:8080
    SECRET_KEY=change-me
    LOG_DIR=./logs
    STORAGE_DIR=./uploads
    
    # Database
    DATABASE_URL=postgresql+psycopg2://USERNAME:PASSWORD@db:5432/maktaba
    
    # Keycloak (D4Science)
    KEYCLOAK_SERVER_URL=https://<keycloak-host>/
    KEYCLOAK_REALM=<realm>
    KEYCLOAK_CLIENT_ID=<client_id>
    KEYCLOAK_CLIENT_SECRET=<client_secret>   # only if confidential
    KEYCLOAK_REDIRECT_URI=${APP_BASE_URL}/oidc/callback
    REQUIRED_ROLE=librarian
    
  3. Start the stack:
    docker compose up -d --build
    
  4. Open http://localhost:8080 and sign in via Keycloak (D4Science).

The OCR pipeline is designed for CPU; GPU is not required.

Usage

Input Files

Supported input formats for OCR and ingestion:

  • PDF, PNG, JPG/JPEG

Output

  • Processed records stored in PostgreSQL (documents, metadata, categories).
  • Browseable library in the Library page; search across full-text/metadata.
  • Downloadable assets (when enabled) under STORAGE_DIR with structured subfolders.

Detailed Explanation

Architecture Overview

  • Web UI & API: NiceGUI app.
  • Processing pipeline: OCR + extraction tasks triggered on ingest; results stored in DB and surfaced to the UI.
  • AuthN/Z: Keycloak (OIDC) with D4Science;
  • Persistence: PostgreSQL database; uploads live on mounted storage volumes.
  • Containerization: Docker Compose to orchestrate app + DB (+ optional reverse proxy).

Core Components

  • UI pages:
    homePage.py, uploadPage.py, cataloguePage.py, documentDetailPage.py, libraryPage.py, auth.py
  • Language tools: languageDetection.py and helpers (Arabic NLP with Tashaphyne/arramooz, langid, NLTK).
  • OCR: EasyOCR/PyTorch routine for PDF/images (with PyMuPDF/pdf2image/Pillow for I/O).
  • Background jobs: scheduling/async processing (APScheduler).
  • DB layer: PostgreSQL connection (psycopg2-binary); application models/tables for documents, metadata, categories, users/roles.
  • Logging: console + file logs for workers (scanner.*.log).

Parsing Logic

Here “parsing logic” refers to the ingestion/OCR/metadata pipeline.

  1. Ingest & normalization
    • Accept PDFs/images; normalize pages.
  2. OCR (Arabic-centric)
    • Page-level OCR via EasyOCR (CPU) with text blocks extracted and cleaned.
  3. Language & metadata hints
    • Language detection; rule- or keyword-based extraction of front-matter and cues.
  4. Category suggestion
    • Match user-defined keyword sets (e.g., static/categoryData.txt) to propose leaf categories.
  5. Record creation
    • Persist document, pages and extracted fields; mark items pending librarian validation.
  6. Catalogue & search
    • Librarians validate/enrich the record; researchers search/browse the library.

Contributing

Contributions are welcome.

  1. Fork the repository (or create a feature branch if working in the organization/Gitea).
  2. Create a branch:
    git checkout -b feature/<your-feature>
    
  3. Commit and push:
    git commit -m "Add <your-feature>"
    git push origin feature/<your-feature>
    
  4. Open a Pull Request. Please follow CONTRIBUTING.md (coding style, branching, testing, issue/PR templates).

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Version: 1.0
Release Date: 2025-10-31

Acknowledgements

This work was supported by the PNRR project Italian Strengthening of ESFRI RI RESILIENCE (ITSERR), funded by the European Union NextGenerationEU (CUP: B53C22001770006).

We warmly thank:

  • FINCONS Group, for their technical support and collaboration during the deployment and integration phases of the platform;
  • UNGUESS, for the initial UI/UX design sketches developed in Figma that inspired the current interface;
  • Giovanni Sullutrone and Luca Sala, for their development and testing efforts.

The views and opinions expressed are those of the authors and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.


For more information, please visit the ITSERR project website: https://www.itserr.it