Speakers
Description
Manual cataloging within the Koha Integrated Library System remains a labor-intensive task, often slowed by repetitive data entry and the persistent risk of human error. This paper introduces a native Koha tool designed to automate the extraction of MARC21 metadata directly from images of book title pages. By allowing librarians to simply upload a photo, the system identifies and populates essential fields such as Author (100), Title (245), Publication Information (260/264), and ISBN (020).
To ensure reliable data extraction from real-world photos, the system employs a sophisticated engineering pipeline. It utilizes ImageMagick for preprocessing—addressing common issues like improper orientation, shadows, and uneven lighting through local adaptive thresholding—before passing the image to Tesseract OCR.
A key innovation of this research is the transition from traditional regex-based parsing and external API dependencies toward the use of locally deployed Large Language Models (LLMs), such as Qwen. By processing raw OCR text through a local LLM, the system can "read between the lines" to reconstruct fragmented titles and organize bibliographic data into structured, MARC-compatible JSON. This context-aware approach significantly improves accuracy when dealing with noisy data while maintaining data privacy by keeping the entire workflow on local hardware.
Ultimately, this tool transforms cataloging from a manual "type-everything" chore into a streamlined "photo-to-verification" model. The result is a faster, more efficient workflow that paves the way for a truly AI-augmented library environment.
| Duration of your presentation (in minutes) | 15 |
|---|