Tools for text digitisation

More than
state-of-the-art tools for text digitisation.

284 results



  • Description:authorship attibution software
  • Group: Text Processing
  • Type: -
  • Subtype: authorship attribution
  • License: GPL?
  • Language: -
  • Developer: Evaluating Variation in Language Laboratory


  • Description:The JHOVE2 project generalizes the concept of format characterization to include identification, validation, feature extraction, and policy-based assessment.
  • Group: text processing
  • Type: language resources
  • Subtype: discovery interface
  • License:
  • Language: n/a
  • Developer:
  • Wiki


  • Description:This OCR engine is implemented as a Java library along with a demo application which shows the library in action. The core concept at the character level is image matching with automatic position and aspect ratio correction using a least-square-error matching algorithm. It is a very simple yet reasonably effective implementation.
  • Group: Text Recognition
  • Type: Core Text Recognition
  • Subtype: -
  • License: BSD
  • Language: -
  • Developer: -


  • Description:A tool that makes it possible to transform metadata from a traditional XMLbased schema to RDFOWLMappings are described with XML Existing mappings used in SYNAT transform traditional librarymuseum formats to the CIDOC CRMFRBRoo ontology
  • Group: metadata processing
  • Type: -
  • Subtype: Format transformation (XML)
  • License:
  • Language: null
  • Developer: Poznań Supercomputing and Networking Center


  • Description:An omnifont OCR software for KDE. Due to the fact that each step of the OCR process can be visualized you can get a quick idea of how OCR works and where the problems lie. However the program may be of minor/no use for end users in its current state.
  • Group: Text Recognition
  • Type: Core Text Recognition
  • Subtype: -
  • License: GPLv2
  • Language: -
  • Developer: -


  • Description:LX-Parser is a statistical constituency parser for Portuguese. It performs a syntactic analysis of Portuguese sentences in terms of their constituency structure.
  • Group: Text Processing
  • Type: NLP Tools
  • Subtype: Parser
  • License: Free
  • Language: Portuguese
  • Developer:


  • Description:Lx-Tagger is a part-of-speech tagger for Portuguese that assigns a single morpho-syntactic tag from the tagset below to every token
  • Group: Text Processing
  • Type: NLP Tools
  • Subtype: POS tagger
  • License: Proprietary
  • Language: 1
  • Developer:


  • Description:LemmaGen project aims at providing standardized open source multilingual platform for lemmatisation. We started this work as a result of lack of high quality lemmatiser for Slovene language. Currently we have not only the lemmatiser for Slovene but also for 11 other European languages and the system which is able to learn lemmatisation rules for new languages by providing it with existing wordform-lemma pair examples.
  • Group: Text Processing
  • Type: NLP Tools
  • Subtype: Stemmer/Lemmatizer
  • License: free open source
  • Language: Slovene11 more
  • Developer:


  • Description:Leptonica is a pedagogically-oriented open source site containing software that is broadly useful for image processing and image analysis applications.
  • Group: Image processing
  • Type: Image Processing and Enhancement
  • Subtype: toolbox
  • License: Own license (similar to ASL)
  • Language: -
  • Developer: Dan Bloomberg


  • Description:For many applications it is important to be able to correctly identify the language that a document or piece of text is written in. The Lextek Language Identifier enables you to do this. Since some languages may be written in several character encodings the Lextek Language Identifier will automatically identify what character encoding the text was written in. Supporting approximately 260 different languages and character encodings the Lextek Language Identifier gives you the ability to automatically recognize more languages and encodings than any other language identifier available. We are adding more languages all the time and work closely with our customers to ensure that their language recognition needs are fully supported.
  • Group: Text Processing
  • Type: NLP Tools
  • Subtype: Language Identification
  • License: commercial
  • Language: 260
  • Developer:


  • Description:The Virtual Lightbox is a software tool for comparing images online.
  • Group: Miscellaneous Utilities
  • Type: -
  • Subtype: image comparison
  • License: GPL
  • Language: -
  • Developer: University of Maryland

Liner2 (NER)

  • Description:Liner2 is a customizable and open-source framework for proper names''recognition. The framework consists of several universal methods for''sequence chunking which include: dictionary look-up pattern matching''and statistical processing.The statistical processing is performed using''Conditional Random Fields and a rich set of features including''morphological lexical and semantic information. We present an''application of the framework to the task of recognition proper names in''Polish texts (5 common categories of proper names i.e. first names''surnames city names road names and country names) and an extended''model to recognize 56 categories of proper names which was used to''bootstrap the manual annotation of KPWr corpus.
  • Group: Text Processing
  • Type: -
  • Subtype: NER
  • License: unknown
  • Language: Polish
  • Developer: The WrocUT Language Technology Group G4.19

Would you like to add any tool?


Registered users can add new tools through a simple form login or register.


Add Tool

Search or filter tools




In demonstrator platform: