Benchmarks for OCR on Ancient Greek Texts

ocr-greek-benchmarks.tar.bz2 (same as above, highly compressed)

OCR (on Ancient Greek) Aligner

This software align the output of three OCR engines and it applies spell-checking on Ancient Greek texts.

Unzip the package and launch the script run. From the main panel, on the first text field, Project File, choose the file in the subdirectory ocr-test. Check all the checkboxes (Make Error Patterns, Make Accuracy Reports, Make Alignment Reports, AdjustOcr) and click on Align. Check the results stored in the subdirectories of ocr-test indicated in the panel with your preferred text editor. Output directories, that you may want to clean up before the tests, are errorpatterns, align, accuracy-reports and all the auxiliary directories starting with underscore.

The old version requires the installation of Aspell dev packages and the installation of the Aspell Dictionary (see below).

Accuracy Evaluator without ground truth

Instructions are contained in the package.

Ancient Greek Spell Checker for OpenOffice 3.0 (Hunspell)

Download and install the extension for OpenOffice grc.oxt. The dictionary is based on Morpheus data (by Perseus Project).
Disclaimer: this is an alpha version: there are many errors that need to be fixed. If you find errors notify me, please.

Ancient Greek Spell Checker (Aspell Dictionary)

Download Ancient Greek Aspell Dictionary based on Morpheus data (by Perseus Project).

Disclaimer: this is an alpha version: there are many errors that need to be fixed. If you find errors notify me, please.

JNI interface to use Aspell from Java is available:
Aspell JNI Interface
Simple example for Aspell JNI Interface

Ancient Greek OCR Trainings

This page collects training images, texts and box maps for Ancient Greek OCR with tesseract, the open source OCR engine now developed at Google.

grl0-alpha-0_1.tar.gzTesseract language files for Ancient Greek texts with Latin bibliographical references
(based on scanned pages extracted from Kaibel's edition of Athenaeus, 1st, 2nd and 3d vol.). It contains:
grl1-alpha-0_1.tar.gzAs above, but additional pages have been generated using TeubnerLSU and Porson fonts
and have been clustered with the scanned pages
gr-lat-ocr-train-alpha_0_1.tar.gzAll text pages, images, box maps that have been
used to generate grl0 and grl1 trainings.
You can download single files.
color-ocropus-alpha_0_1.tar.gz Colored image for training OCRopus,
the state-of-the-art document analysis
and OCR system.

How to use the data files

If you have installed tesseract:

cp grl0-alpha-0_1.tar.gz $TESSDATA (usually /usr/local/share/tessdata)
cp grl1-alpha-0_1.tar.gz $TESSDATA
tar xvzf grl0-alpha-0_1.tar.gz
tar xvzf grl1-alpha-0_1.tar.gz

Now, you can use the new sets on your scanned pages:

tesseract mygreekpage.tif out-textfile -l gl0 or -l gl1 instead of -l gl0

Currently, this training data have been tested only on Teubner edition of Athenaeus.

Any feedback is extremely welcome!

Latin OCR Trainings

AUG., De civitate Dei, Venetiis 1475
Trainings files and documents on latin incunabula are now available:
report-ocr-performances-on-aug-decivitatedei-venetiis-1475.pdf: report about accuracy rate
lic.tar.gz: language plug-in for tesseract on incunabula
lics4tests.tar.gz: language sets to perform tests (5-fold crossing validation)
tests.tar.gz: test reports
trainings.tar.gz: training resources

Ancient Greek Hyphenation Patterns for FOP translated from the original Latex/TeX files

Dowload el_EL.tar.gz

Aesch. Persae Treebank

Download Aesch. Persae Treebank
The archive contains the XML file to use with TIGERSearch and a complete dump of all tree images.
Details about the theoretical framework and explanations about the labels used in the treebank are provided in F. Boschetti, Saggio di Analisi linguistiche e stilistiche condotte con l'ausilio dell'elaboratore elettronico sui Persiani di Eschilo, Lille, Ph.D. Thesis, March 2005, Chapter 3: "Sintassi".

Ancient Greek Transcoder

Download the Ancient Greek Transcoder source code (with beta2unicode init file).
