VietOCR is a Java GUI frontend for Tesseract OCR engine, providing character recognition support for common image formats, and multi-page images. The program has postprocessing which helps correct errors regularly encountered in the OCR process, boosting the accuracy rate on the result. The program can also function as a console application, executing from the command line.
Batch processing is now supported. The program monitors a watch folder for new image files, automatically processes them through the OCR engine, and outputs recognition results to an output folder.
Java Runtime Environment 8 or later. On Windows, Microsoft Visual C++ 2022 Redistributable Package is also required.
Tesseract Windows executable is bundled with the program. Additional
language data packs for Tesseract, whose names start with ISO639-3 codes,
should be placed into the tessdata
subdirectory.
Per a Linux, el Tesseract i els paquests de llengua es troben al dipòsit Graphics (universe). Es poden instal·lar usant el Synaptic o amb l'ordre següent:
sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-vie
The files will be placed in /usr/bin
and /usr/share/tesseract-ocr/tessdata
,
respectively. On the other hand, if Tesseract is built and installed from the source,
they will be placed in /usr/local/bin
and /usr/local/share/tessdata
.
You can also let VietOCR know the location
of tessdata
via the environment variable TESSDATA_PREFIX
:
export TESSDATA_PREFIX=/usr/local/share/
Per a altres plataformes, consulteu la pàgina wiki del Tesseract.
El VietOCR també permet baixar i instal·lar paquets de llengua via l'element del menú Baixa dates de llengua. Depenent de la ubicació de la carpeta tessdata
, se us demanarà d'executar el programa com a usuari root o administrador per a poder instal·lar les dades baixades a la carpeta si es troba a la carpeta del sistema, per exemple a /usr
en el Linux o C:\Program Files
en el Windows.
Scanning support on Windows is provided via the Windows Image Acquisition Library v2.0.
A Linux, l'escaneig requereix la instal·lació dels paquets SANE:
sudo apt-get install libsane sane sane-utils libsane-extras xsane
PDF support is possible via PDFBox.
Spellcheck functionality is available through Hunspell, whose
dictionary files (.aff
, .dic
) should be placed
in dict
folder of VietOCR. user.dic
is an UTF-8-encoded
file which contains a list of custom words, one word per line.
A Linux, l'Hunspell i els seus diccionaris es poden instal·lar amb el Synaptic o apt
,
com segueix:
sudo apt-get install hunspell hunspell-ca
Per a executar el programa:
java -jar VietOCR.jar
Nota: Si trobeu una exepció d'esgotament de memòria, executeu ocr
fitxer script en comptes d'usar el .jar.
The Vietnamese language data were generated for Times New Roman, Arial, Verdana, and Courier New fonts. Therefore, the recognition would have better success rate for images having similar font glyphs. OCRing images that have font glyphs look different from the supported fonts generally will require training Tesseract to create another language data pack specifically for those typefaces. Language data for some VNI and TCVN3 (ABC) fonts have also been bundled in latest versions.
Images to be OCRed should be scanned at resolution from at least 200 DPI (dot per inch) to 400 DPI in monochrome (black&white) or grayscale. Scanning at higher resolutions will not necessarily result in better recognition accuracy, which currently can be higher than 97% for Vietnamese, and the next release of Tesseract may improve it even further. Even so, the actual rates still depend greatly on the quality of the scanned image. The typical settings for scanning are 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF or PNG format.
The Screenshot Mode offers better recognition rates for low-resolution images, such as screen prints, by rescaling them to 300 DPI.
In addition to the built-in text postprocessing algorithm, you can add your own
custom text replacement scheme via a UTF-8-encoded tab-delimited text file named x.DangAmbigs.txt
,
where x is the ISO639-3 language code. Both plain and Regex text replacements are supported.
You can put init-only and non-init control parameters in tessdata/configs/tess_configs
and tess_configvars
files, respectively, to modify Tesseract's
behaviour.
Some built-in tools are provided to merge several images or PDF files into a single one for convenient OCR operations, or to split a TIFF or PDF file into smaller ones if it contains too many pages, which can cause out-of-memory exceptions.
The recognition errors can generally be classified into three categories. Many of the errors are related to the letter cases — for example: hOa, nhắC — which can be easily corrected by popular Unicode text editors. Many other errors are a result of the OCR process, such as missing diacritical marks, wrong letters with similar shape, etc. — huu – hưu, mang – marg, h0a – hoa, la – 1a, uhìu - nhìn. These can also be easily fixed by spell checker programs. The built-in Postprocessing function can help correct many of the aforementioned errors.
The last category of errors is the most difficult to detect because they are semantic errors, which means that the words are valid entries in the dictionary but are wrong in the context — e.g., tinh – tình, vân – vấn. These errors require the editor to read though and manually correct them according to the original image.
Following are instructions on how to correct the first two categories of OCR errors using the built-in functionality:
Through the above process, most of common errors can be eliminated. The remaining, semantic errors are few, but it requires a human editor to read though and make necessary edits to make the document like the original scanned document, and error-free if desired.
Si teniu cap pregunta, escriviu al fòrum del VietOCR.