| Title: | Interface to gocr Program |
|---|---|
| Description: | Wrapper functions to the gocr (Optical Character Recognition) program developed by Jens Schulenberg (https://www-e.ovgu.de/jschulen/ocr/). |
| Authors: | Hauke Sonnenberg [aut, cre] (ORCID: <https://orcid.org/0000-0001-9134-2871>), Michael Rustler [ctb] (ORCID: <https://orcid.org/0000-0003-0647-7726>), FAKIN [fnd], MIA-CSO [fnd], Kompetenzzentrum Wasser Berlin gGmbH (KWB) [cph] |
| Maintainer: | Hauke Sonnenberg <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-24 07:57:13 UTC |
| Source: | https://github.com/KWB-R/kwb.gocr |
Create a gocr Configuration
gocrConfig( inputfile, showhelp = FALSE, outputfile = "", errorfile = "", progressfile = "", databasepath = "", outputformat = "", greylevel = 0, dustsize = -1, spacewidth = 0, verbosity = 0, limitVerbosityToChars = "", limitRecognitionToChars = "", certainty = 95, mode = 0, onlyRecogniseNumbers = FALSE )gocrConfig( inputfile, showhelp = FALSE, outputfile = "", errorfile = "", progressfile = "", databasepath = "", outputformat = "", greylevel = 0, dustsize = -1, spacewidth = 0, verbosity = 0, limitVerbosityToChars = "", limitRecognitionToChars = "", certainty = 95, mode = 0, onlyRecogniseNumbers = FALSE )
inputfile |
-i file: read input from file (or stdin if file is a single dash) |
showhelp |
-h: show usage information |
outputfile |
-o file: send output to file instead of stdout. If ""
(default), a file "gocrOut_<basename(outputfile)>" in
|
errorfile |
-e file: send errors to file instead of stderr or to stdout if file is a dash |
progressfile |
-x file: progress output to file (file can be a file name, a fifo name or a file descriptor 1...255), this is useful for GUI developpers to show the OCR progress, the file descriptor argument is only available, if compiled with __USE_POSIX defined |
databasepath |
-p path: database path, that will be populated with images of learned characters. If "" (default), and a database is needed, a directory within the folder of the installed package is used |
outputformat |
-f format: output format of the recognized text (ISO8859_1 TeX HTML XML UTF8 ASCII), XML will also output position and probability data |
greylevel |
-l level set grey level to level (0<160<=255, default: 0 for autodetect), darker pixels belong to characters, brighter pixels are inter- preted as background of the input image |
dustsize |
-d size: set dust size in pixels (clusters smaller than this are removed), 0 means no clusters are removed, the default is -1 for auto detection |
spacewidth |
-s num: set spacewidth between words in units of dots (default: 0 for autodetect), wider widths are interpreted as word spaces, smaller as character spaces |
verbosity |
-v verbosity: be verbose to stderr; verbosity is a bitfield.
Use |
limitVerbosityToChars |
-c string: only verbose output of characters from string to stderr, more output is generated for all characters within the string, the |
limitRecognitionToChars |
-C string: only recognise characters from string, this is a filter function in cases where the interest is only to a part of the character alphabet |
certainty |
-a certainty: set value for certainty of recognition (0..100; default: 95), characters with a higher certainty are accepted, characters with a lower certainty are treated as unknown (not recognized); set higher values, if you want to have only more certain recognized characters |
mode |
-m mode: set oprational mode; mode is a bitfield (default: 0).
Use |
onlyRecogniseNumbers |
-n bool: if bool is non-zero, only recognise numbers (this is now obsolete, use -C "0123456789") |
Download gocr executable
gocrDownload( version_number = "048", overwrite = FALSE, target_dir = file.path(system.file(package = "kwb.gocr"), "extdata/gocr") )gocrDownload( version_number = "048", overwrite = FALSE, target_dir = file.path(system.file(package = "kwb.gocr"), "extdata/gocr") )
version_number |
latest version number is "049". However, "048" was used for the development of this R package and is still available (default: "048") |
overwrite |
if TRUE downloads and overwrites existing gocr executable in target_directory, otherwise not (default: FALSE) |
target_dir |
target directory (default: file.path(system.file(package = "kwb.gocr"), "extdata/gocr2") |
downloads gocr executable to target directory and returns path
gocrDownload(version_number = "048") ## Not run: gocrDownload(version_number = "049") ## End(Not run)gocrDownload(version_number = "048") ## Not run: gocrDownload(version_number = "049") ## End(Not run)
Path to gocr Executable File
gocrExePath()gocrExePath()
Option String for gocr Call
gocrOptionString(config)gocrOptionString(config)
config |
gocr configuration as returned by |
Run gocr on an Image File
gocrRun( config, useBatch = TRUE, waitForBatch = TRUE, opendir = TRUE, dbg = TRUE )gocrRun( config, useBatch = TRUE, waitForBatch = TRUE, opendir = TRUE, dbg = TRUE )
config |
gocr configuration as returned by |
useBatch |
if TRUE (default), a batch file is written so that the user can reproduce the call by double-clicking the batch file in the file explorer (opens when opendir is TRUE) |
waitForBatch |
passed to |
opendir |
if TRUE (default), and if useBatch is TRUE, the directory in which the batch file is written, is opened in the Windows Explorer |
dbg |
if |
(only if waitForBatch = TRUE): result of OCR as a vector of character representing the recognised lines. The result vector has the attribute config containing the configuration used (original config, with default values set where needed)
Value for Mode Option
optionValueMode( useDatabase = FALSE, layoutAnalysis = FALSE, doNotCompare = FALSE, doNotDivide = FALSE, doNotCorrect = FALSE, characterPacking = FALSE, extendDatabase = FALSE, switchOffEngine = FALSE )optionValueMode( useDatabase = FALSE, layoutAnalysis = FALSE, doNotCompare = FALSE, doNotDivide = FALSE, doNotCorrect = FALSE, characterPacking = FALSE, extendDatabase = FALSE, switchOffEngine = FALSE )
useDatabase |
(2) use database to recognize characters which are not recognized by other algorithms, (early development) |
layoutAnalysis |
(4) switching on layout analysis or zoning (development) |
doNotCompare |
(8) don't compare unrecognized characters to recognized one |
doNotDivide |
(16) don't try to divide overlapping characters to two or three single characters |
doNotCorrect |
(32) don't do context correction |
characterPacking |
(64) character packing, before recognition starts, similar characters are searched and only one of this characters will be send to the recognition engine (development) |
extendDatabase |
(128) extend database, prompts user for unidentified characters and extends the database with users answer (128+2, early development) |
switchOffEngine |
(256) switch off the recognition engine (makes sense together with -m 2) |
http://manpages.ubuntu.com/manpages/gutsy/man1/gocr.1.html
Value for Option Verbosity
optionValueVerbosity( printMore = 1, listShapes = 1, listPattern = 1, printPattern = 1, printDebug = 1, createOutPng = 0 )optionValueVerbosity( printMore = 1, listShapes = 1, listPattern = 1, printPattern = 1, printDebug = 1, createOutPng = 0 )
printMore |
(1) print more info |
listShapes |
(2) list shapes of boxes (see -c) to stderr |
listPattern |
(4) list pattern of boxes (see -c) to stderr |
printPattern |
(8) print pattern after recognition for debugging |
printDebug |
(16) print debug information about recognition of lines to stderr |
createOutPng |
(32) create outXX.png with boxes and lines marked on each general OCR-step |