Defining an OCR Value Source

You can extract text or barcodes from a scanned document using optical character recognition (OCR) and use them as automatic property values for files imported from an external source, a scanner in this case. The OCR value source is a zone defined on a scanned page. For more information on defining different properties for objects imported from external file sources, see Defining Metadata for an External File Source.

The use of an OCR value source is only possible when using an external source. The OCR value source cannot be defined in M-Files Desktop.

Note: The M-Files OCR module is an M-Files add-on product available for extra fee. It can be activated with a license code. The old license code must be replaced by the license code that enables the use of OCR. For more information, refer to License Management. In order to enable OCR, you need also to download and install some additional files to your M-Files Server (for further information, contact our customer support). The OCR related functions will then be available in M-Files Admin and M-Files Desktop. M-Files uses an OCR engine offered by I.R.I.S. For the M-Files OCR module purchase inquiries, please contact our sales team at [email protected].
Note: You can use the OCR value source without enabling the Use OCR to enable full-text search of scanned documents option in the Searchable PDF tab.

Do the following steps to define an OCR value source:

Steps

  1. Open M-Files Admin.
  2. In the left-side tree view, expand the desired connection to M-Files Server.
  3. In the left-side tree view, expand the document vault of your choice.
  4. Still in the left-side tree view, expand Connections to External Sources and then select File Sources.
  5. On the File Sources list, double-click the file source that you want to edit.
    The Connection Properties dialog is opened.
  6. Click the Metadata tab.
    The Metadata tab is opened.
  7. Click Add... to define a new property and value to be added automatically for objects created from external files, or select one of the existing properties and click Edit... to edit the existing property.
    The Define Property dialog is opened.
  8. Select the option Use an OCR value source and click the Define... button.
    The OCR Value Source Definition dialog is opened.
  9. In the Zone type section, select either:
    1. Text: Select this option if the OCR zone contains text.
      or
    2. Barcode: Select this option if the OCR zone contains a barcode.
      Note: M-Files recognizes most of the 1D barcodes in use and two types of 2D bar code: PDF417 and QR Code. If you are using an OCR supported license code that has been delivered before the version 9.0, please ask our customer service to provide you a new license code if you want to use barcode recognition.
  10. In the Zone position section, define a zone from which to extract a value for the selected property. The characters may include any letters, numbers or punctuation marks. For example, an invoice number shown on a page can be added as the Invoice number property value for the scanned document.
    An example of a zone definition:
    If you are capturing a barcode and there is only one barcode to recognize on the page, you can specify the whole page as the zone. If there are several barcodes, restrict the zone in a such a way that it contains the desired barcode only. With QR codes, you should specify a zone larger than the actual barcode. If the specified zone has several barcodes, all of them are considered to be a property value.
    1. In the Page field, enter the page number of the scanned document that you want to use as the OCR value source.
    2. Using the Unit options, select the appropriate unit for defining the zone position.
    3. In the Left field, enter the left corner position of the OCR zone. The left corner of the scanned document is considered "0".
    4. In the Right field, enter the right corner position of the OCR zone.
    5. In the Top field, enter the top corner position of the OCR zone. The top corner of the scanned document is considered "0".
    6. In the Bottom field, enter the bottom corner position of the OCR zone.
  11. Using the Primary language and Secondary language drop-down menus, select the primary and secondary language of the documents scanned via this external connection in order to improve the quality of the recognition results. The list of secondary languages only contains languages that are allowed to be used with the selected primary language.
    Although the OCR automatically recognizes all Western languages and Cyrillic character sets, specifying a language selection often improves the quality of the text recognition results. In ambiguous cases, a problematic recognition result may be resolved by a language-specific factor, such as recognition of the letter 'Ä' in Finnish. The list of secondary languages only includes languages that are allowed to be used together with the selected primary language.
  12. Click OK to close the OCR Value Source Definition dialog.
  13. Back in the Define Property dialog, select either:
    1. Use the value read as the ID of the item: Select this option if you want to use the captured value as an identifier of the value list item with a separately defined name.
      or
    2. Use the value read as the name of the item: Select this option if you want to use the captured value as the name of the value list item. You can check the Add a new item to the list if a matching item is not found option check box if you want to add a new value list item whenever a new value is captured.
  14. Click OK to close the Define Property dialog.

Results

The zone you have just defined is used to automatically extract a value for the selected property using OCR whenever a new object is created via the selected external file source.

What to do next

To ensure that the defined zone is correctly positioned, in most cases the document to be scanned should be placed onto the scanner glass by hand rather than fed via an automatic sheet feeder.

In some cases, the OCR may give an incorrect recognition result of the text: for example, depending on the font type or size, the number 1 may be interpreted as the letter I. To ensure that the characters are added correctly to the document metadata, you can check the property values with event handlers and VBScript. You can then use VBScript to check, for example, that all added characters are numbers. For more information, see Event Handlers.