Full-text Transcription with Shundlikht

Shundlikht is an open source Python program (Jupyter Notebook) that, once run locally, automatically transcribes, translates (if you want), and collates the pdf images associated with each installment of a given work in the database.  The images associated with each installment are hosted by the National Library of Israel (NLI).

Data moves through the Shundlikht notebook in four phases:
1. The csv file output associated with the “Export search results” function button on represents the input data, and, from that document, individual installments making up a target work are automatically downloaded to the local user’s working directory.
2. Using a modified version of the sample code snippet available from Google, we upload “raw” pdf images to the Google’s Cloud Translate API, returning Unicode results in the target language and type.  With more than 130 languages supported, this is a powerful interchange.
3. PyPDF’s “extract text” function is then employed, to extract and order plaintext transcriptions from the Google Cloud pdf outputs, writing text documents to the local user’s working directory.
4. Finally, a simple glob script collates the plaintext transcriptions, combining individual installments to form a single “bag-of-words” representing the full text of the target work.

Yiddish transcription remains an unsolved technical challenge. Shundlikht relies on proprietary ML models from Google to generate transcriptions, for example. Moreover, varied input data – spanning history, media, and format – (and OCR software localization limitations) have prevented automated techniques from achieving high-accuracy transcriptions at scale. Yet there is great progress in this space.
Both core software packages support right-to-left language localization out of the box, and multiple, international groups of scholars – including the DiJest and Jochre teams - are encouraging workflows for generating full-text transcriptions from historical Yiddish publications.  Each effort has its merits and the team is in constant conversation with these groups, meeting regularly to share our respective approaches to processing these rich but heretofore uncompiled corpora. We’re using Google in this case because of the immediate ability to offer rough translation, thereby opening the corpus, at least partially, to scholars without Yiddish knowledge and enabling inputting the data into newer applications that don’t yet support Hebrew/Yiddish texts.

For now, Shundlikht requires some knowledge of Python, and, since they rely on third-party transcription and extraction software (Google and PyPDF respectively), column ordering within a single page is difficult to parse. Thus, what we have is a bag of words rather than ordered text whose stylistics could be fully analyzed. Caveats aside, researchers now have the opportunity to engage with the full text contents of more than 9,400 individual Yiddish works not included in book collections like the Yiddish Book Center and differentiated from the mass of untagged text that makes up the JPress archive.

by Matt Cook

Published on Friday, August 11, 2023