Document QA With Limited Semantics

A large organisation will produce many documents over the course of a year, but most organisations spend almost no time assessing their quality. Some of the documents may form the basis of projects worth hundreds of millions or billions of dollars, or set in place contractual arrangements that last for decades, or control safety-critical responses, making the knowledge within the documents highly valuable.

In other areas, Quality Assurance (QA) is an essential part of the process:

Eggs are inspected for quality, washed bottle are checked for cleanliness, critical welds are routinely radiographed, the composition of metal alloys are checked in a laboratory as they are produced. In the case of welds, the weldment may be subjected to a stress test – the document equivalent would be testing it with a skewed sample of the target audience – people who are likely to be affected by the document and are willing to voice their opinions. Virtually the only time this happens now is with advertisements – other documents within the organisation can be vastly more valuable to the organisation, with losses in the billions or long-term confusion in the workforce.

Why do we do so little with documents – nothing past a simple spell check? Is it because we have faith in the ability of people to understand what the documents mean, or is it because we sheet home the blame for failure to the person who wrote or read the document, rather than the document itself? Or is it that we thought nothing can be done? The arrival of computers with gigabyte memories running at gigahertz speeds has changed the calculus.

Interactive Engineering has been working on a system to read specifications and otherr high value documents and extract the complete semantic structure. To be fully effective in reading, the system needs deep knowledge of the domain in which it is working, making it unsuitable to handle documents that lack such high intrinsic value. However, part of the system is a lexical and simple structural analysis tool, which does not rely on deep domain knowledge. This tool can be used to run QA on documents which are important but are not worthy of full semantic structure analysis. It checks structural integrity using many different types of check – a referenced item really is there, an indexed list isn’t messed up, defined terms are used coherently, acronyms are either in a glossary or in the organisation’s database, dimensional units are valid. It runs every check that is possible without reading the document in depth. The result of this check can be used as an indicator of the quality of the document.

We are suggesting that a small group of people be set up in a large organisation, with the purpose of providing QA on the documents being produced or introduced into the organisation. Their job is to act as a filter and stop poor quality material getting into the system. The logical place for this filtering is just before the document is placed in the Document Management System (DMS), but checks of documents in draft form can also be made. The people in the group become familiar with all the machinery of knowledge found in documents – the way that large documents are structured, with defined terms either in a block at the start of the document or scattered through the document, a glossary, a hierarchical structure of sections, control of existence from one section to another.

As with other forms of QA, sampling needs to be used to limit the workload. If documents from a source are regularly clean, then sampling can be seldom, unless the source begins producing documents outside its norm. If documents from a source are regularly flagged with errors, then attention is paid there until the source is clean, or it is decided the documents are of limited value, and the problems resulting from the poor quality are of little consequence.

"We have our lawyers check important documents" – lawyers are looking for quite different things, and are easily overwhelmed by large documents – there is too much to hold in the head at once, and they are probably not familiar with the technicalities of what the document is describing.

Introducing Document QA into an organisation is a good way of getting some cleanup while determining whether the next step – full semantic structure extraction and analysis – is worthwhile.

Document Content QA

Semantic Analysis Products

NLP