A breakthrough in the field of natural language processing (NLP) has been achieved with the Cortical.io Retina engine, which elegantly solves numerous text-data processing problems faced by big businesses. The Retina engine surmounts the obstacles in dealing with terabytes of unstructured text data, thanks to a unique algorithm that is based on neuroscientific research into how the human brain processes information. The technology can analyze the meaning of not just keywords, but of whole sentences, paragraphs, and long texts. It can also be applied to documents in different languages. By focusing on meaning, the challenges of language ambiguity and vocabulary mismatch are overcome. For example, the phrases “we closed the deal” and “the contract was signed” have similar meanings but use completely different words; the Cortical.io Retina engine recognizes that similarity.
AI-based contract intelligence
Major global companies are leveraging a wealth of data from contracts and other legal documents by deploying Cortical.io Contract Intelligence, a highly accurate data-extraction solution that combines the Retina engine with various NLP techniques. These companies use Cortical.io technology to automate the precise extraction of key information from thousands of complex contracts that contain disparate and diverse language. Automation has freed up significant manual resources, greatly reduced costs, and corrected human error in the extraction of information. Firms are able to quickly generate consistent and comparable summary abstracts and spreadsheets, manage contract lifecycles in real time, and gain valuable insights into the financial situation of potential clients. Financial institutions are lowering credit risk by identifying clauses associated with performing and non-performing contracts, and large firms are meeting new legal requirements by being able to quickly search agreements for the correct figures to be added to the balance sheet.
A Mix of unsupervised Machine Learning and Expert Feedback
Cortical.io Contract Intelligence is a standalone system whose input is information from data sources that contain contracts and other documents. The engine processes this unstructured or semi-structured information and returns structured key information that is easily used in further business information analysis processes.
Data can be extracted from new contract types based on how the requested information is written in only three to ten sample contracts. The technology amplifies company intelligence and increases accuracy through a unique combination of unsupervised machine learning and an iterative fine-tuning process involving the companies’ subject matter experts (SMEs).
SMEs typically go through five to ten simple iterations, interacting with the system to produce the best results.
- SMEs define the type of information that they want to extract.
- In an unsupervised learning phase, the system learns to recognize contract vocabulary and concepts (for example, facilities, loans, variations of dates, and parties to an agreement) and forms relationships among the concepts.
- Information is extracted from the contracts.
- Based on the results of the extraction, SMEs fine-tune the system by adding to or modifying the type of information requested.
Using the brain as a model for artificial intelligence
Cortical.io Contract Intelligence makes use of the Retina engine, which was developed from the Semantic Folding theory of how the human brain works. The brain’s neocortex is a 2D sheet of neuron assemblies that process information such as text, images and sounds. A mathematical model called Sparse Distributed Representation simulates how the neocortex stores this information. In this model, each piece of information is represented by a long binary vector that has many “zeros” (inactive bits) and relatively few “ones” (active bits). Each active bit contains some part of the information’s meaning. If the same bit is active in two vectors, the two pieces of information that the vectors represent are, in at least one aspect, similar in meaning. The greater the number of active bits in common, the more similar the two pieces of information.
Collecting text to form a semantic space
To create an efficient, elegant system for natural language processing, Cortical.io technology gathers text from selected reference literature. The text is cut into meaning-based slices, called snippets, which are then distributed over a 2D grid. Snippets that have similar meaning are placed close to one other. The 2D-grid is called a semantic space, and, in this space, each snippet has a pair of coordinates.
Representing the meaning of words numerically
To represent the meaning of any word numerically, the Cortical.io Retina engine activates the grid positions of all snippets that contain the word. The resulting grid representation of the meaning of the word is known as the word’s semantic fingerprint.
Example of a semantic fingerprint of a word
The grid can be unfurled to form a long binary vector, where each active position on the grid corresponds to an active bit in the vector. The binary vector is then the numerical representation of the word’s meaning and can be used for comparison and computation—operations that are crucial to the efficient application of NLP. To convert a longer piece of text to a semantic fingerprint, the system first converts each word of the text to an individual fingerprint and then combines the fingerprints.
Find out more at the Data Natives workshop and presentation
At the Data Natives conference in Berlin, 15-17 November 2017, Cortical.io co-founder and General Manager, Francisco Webber, will explain how Cortical.io Conference Intelligence technology works and hold a workshop to extract information from sample contracts.
For more information about Cortical and Semantic Folding theory, you can also see the video about Cortical.io Retina technology.
Interesting creative and elegant thinking to address the challenging text based unstructured data problem. Particularly interesting is the convergence of AI into the lower level processing and representation. I have looked at something similar in my research. Regards Dr Kulvinder Panesar e-mail: [email protected]