|
|
||
|---|---|---|
| Project_maktaba_v2 | ||
| docs | ||
| .DS_Store | ||
| Dockerfile | ||
| README.md | ||
| docker-compose.yml | ||
README.md
Digital Maktaba Tool
Author: DataRiver S.r.l., Modena, Italy.
Date: 31/10/2025
License: GNU General Public License v3.0
Table of Contents
- Introduction
- Features
- Prerequisites
- Installation
- Usage
- Detailed Explanation
- Contributing
- License
- Acknowledgements
Introduction
This repository hosts a digital library for documents written in non-Latin scripts, developed within ITSERR – WP5 (Digital Maktaba). It streamlines the end-to-end workflow for librarians and researchers: upload → OCR (Arabic-centric) → metadata extraction → semi-automatic cataloguing → search/browse.
Features
- Arabic OCR pipeline (CPU): batch processing of PDFs and images with post-processing helpers.
- Metadata extraction & suggestions: front-matter cues and keyword-based category suggestions.
- Cataloguing UI: librarian-friendly editing/validation of records.
- Search & browse: library page with full-text/filters and document detail view.
- Roles & permissions: Librarian (ingest/catalogue) and Researcher (search/consult).
Prerequisites
- Runtime (dev): Python ≥ 3.10 (venv recommended)
- Deploy: Docker + Docker Compose
- Auth: Keycloak (via D4Science, OIDC client configured)
- Database: PostgreSQL (containerized via Compose or external)
Core Python libraries (see requirements.txt for the full list): NiceGUI, EasyOCR (PyTorch, torchvision), Pillow, PyMuPDF/fitz, pdf2image, NLTK, langid, Tashaphyne + arramooz-pysqlite (Arabic NLP), APScheduler, psycopg2-binary (PostgreSQL).
Installation
- Clone this repository.
git clone https://code-repo.d4science.org/Resilience/wp5_digital_maktaba cd wp5_digital_maktaba - Create a
.envfile in the project root (example values below):# App APP_BASE_URL=http://localhost:8080 SECRET_KEY=change-me LOG_DIR=./logs STORAGE_DIR=./uploads # Database DATABASE_URL=postgresql+psycopg2://USERNAME:PASSWORD@db:5432/maktaba # Keycloak (D4Science) KEYCLOAK_SERVER_URL=https://<keycloak-host>/ KEYCLOAK_REALM=<realm> KEYCLOAK_CLIENT_ID=<client_id> KEYCLOAK_CLIENT_SECRET=<client_secret> # only if confidential KEYCLOAK_REDIRECT_URI=${APP_BASE_URL}/oidc/callback REQUIRED_ROLE=librarian - Start the stack:
docker compose up -d --build - Open
http://localhost:8080and sign in via Keycloak (D4Science).
The OCR pipeline is designed for CPU; GPU is not required.
Usage
Input Files
Supported input formats for OCR and ingestion:
- PDF, PNG, JPG/JPEG
Output
- Processed records stored in PostgreSQL (documents, metadata, categories).
- Browseable library in the Library page; search across full-text/metadata.
- Downloadable assets (when enabled) under
STORAGE_DIRwith structured subfolders.
Detailed Explanation
Architecture Overview
- Web UI & API: NiceGUI app.
- Processing pipeline: OCR + extraction tasks triggered on ingest; results stored in DB and surfaced to the UI.
- AuthN/Z: Keycloak (OIDC) with D4Science;
- Persistence: PostgreSQL database; uploads live on mounted storage volumes.
- Containerization: Docker Compose to orchestrate app + DB (+ optional reverse proxy).
Core Components
- UI pages:
homePage.py,uploadPage.py,cataloguePage.py,documentDetailPage.py,libraryPage.py,auth.py - Language tools:
languageDetection.pyand helpers (Arabic NLP with Tashaphyne/arramooz, langid, NLTK). - OCR: EasyOCR/PyTorch routine for PDF/images (with PyMuPDF/pdf2image/Pillow for I/O).
- Background jobs: scheduling/async processing (APScheduler).
- DB layer: PostgreSQL connection (
psycopg2-binary); application models/tables for documents, metadata, categories, users/roles. - Logging: console + file logs for workers (
scanner.*.log).
Parsing Logic
Here “parsing logic” refers to the ingestion/OCR/metadata pipeline.
- Ingest & normalization
- Accept PDFs/images; normalize pages.
- OCR (Arabic-centric)
- Page-level OCR via EasyOCR (CPU) with text blocks extracted and cleaned.
- Language & metadata hints
- Language detection; rule- or keyword-based extraction of front-matter and cues.
- Category suggestion
- Match user-defined keyword sets (e.g.,
static/categoryData.txt) to propose leaf categories.
- Match user-defined keyword sets (e.g.,
- Record creation
- Persist document, pages and extracted fields; mark items pending librarian validation.
- Catalogue & search
- Librarians validate/enrich the record; researchers search/browse the library.
Contributing
Contributions are welcome.
- Fork the repository (or create a feature branch if working in the organization/Gitea).
- Create a branch:
git checkout -b feature/<your-feature> - Commit and push:
git commit -m "Add <your-feature>" git push origin feature/<your-feature> - Open a Pull Request. Please follow
CONTRIBUTING.md(coding style, branching, testing, issue/PR templates).
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Version: 1.0
Release Date: 2025-10-31
Acknowledgements
This work was supported by the PNRR project Italian Strengthening of ESFRI RI RESILIENCE (ITSERR), funded by the European Union – NextGenerationEU (CUP: B53C22001770006).
We warmly thank:
- FINCONS Group, for their technical support and collaboration during the deployment and integration phases of the platform;
- UNGUESS, for the initial UI/UX design sketches developed in Figma that inspired the current interface;
- Giovanni Sullutrone and Luca Sala, for their development and testing efforts.
The views and opinions expressed are those of the authors and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.
For more information, please visit the ITSERR project website: https://www.itserr.it