Keywords are short sequences of words that appear more commonly in a particular document than would otherwise be expected from the english language. Keywords summarize what unique phrases and language are used in the document.
In a nutshell
We compare the text of the IssueLab document sent to the API against the text of the entire corpus of English Wikipedia. In a nutshell, we look for phrases that appear more commonly in the IssueLab document as compared to all of Wikipedia, but not so common that it is uninformative.
To Start: Text Extraction
First, we extract all the text of a document, keeping track of all single words, 2-word combinations (2-grams), and 3-word combinations (3-grams). These are all the possible keywords and key phrases that are associated with this document.
Wikipedia as Proxy for English
Next, we compare the extracted text of the document to a different set of extracted text: Wikipedia. We use the entire downloaded corpus, the set of all English Wikipedia articles from wikipedia.org. Of this giant corpus, we keep track of all single words, 2-grams, and 3-grams, along with their frequencies. This is our proxy for relative word and phrase frequencies of the English language.
Next, we compute a score for each word and phrase in the IssueLab document that captures the "relevancy" of the word/phrase to the IssueLab document: the higher the score, the more relevant the word or phrase is to the document. The calculation of the score takes into account a few factors: a count of how often this word/phrase appears in the IssueLab document, a count of how often this word/phrase appears in all of Wikipedia, and the number of unique Wikipedia documents in which this word/phrase appears.
The more often the word/phrase appears in the IssueLab document, the higher its score. The more often it appears in Wikipedia, the lower its score. The higher the number of unique documents in which this word/phrase appears, the lower its score.
An Approximate Rubric
Roughly speaking, the score corresponds roughly to:
- 0.00 - 0.25: not descriptive
- 0.25 - 0.50: marginally descriptive
- 0.50 - 1.00: descriptive
- 1.00 and above: very descriptive
Key findings are commonly included in many grey literature documents to highlight the most important content. Key findings are extracted directly from the documents when they exist.
Text and metadata extractionFirst, we process the pdf document to extract as much information as possible. This means that for every letter or character in the document, we extract the font-type, font-size, and absolute position on the page.
Resynthesis of the document's structure
From this extracted data, we determine which characters form words (and which areas are spaces between words), which group of words form paragraphs or columns, which group of words form titles, and which group of words form headers, footers, and page numbers. Then we can deduce the sections of the text.
Isolate the Relevant Sections
Now that we have the entire document organized into sections of body text and headings, we look for two classes of headings. In both classes of headings, the associated sections are more likely to include key findings.
The first class includes phrases like "Key Findings," "Summary of Findings," and "Complete Research Findings." For example, if the heading/title of a section is called "Key Findings," then that section's body text will almost certainly contain the desired text we want to extract.
The second class includes phrases like "Introduction", "Executive Summary," and "Overview." While sections with these headings might have what we're looking for, most of these sections will have other text that are not key findings.
Catch the Bullets
In all of the above sections of interest, we search for body text that look like bullets. The bullets can be standard cirular bullets, squares, repeated letters in symbolic fonts (like Wingdings), or numbered lists.
No Bullets in First Class
First, we look for bullets in any sections with headings belonging to the first class.
If we don't find bullets, i.e. if we find that the document has a section called something like "Key Findings" but we don't find bullets in that section, then we return the heading, the page number of the heading, and categorize this type of key finding as "no_bullets." This categorization means that the document has a "Key Findings" section that is likely human readable, but the API didn't extract any key findings.
No Bullets in Second Class
If we didn't find bullets in the first class headings, then we continue looking for bullets in the second class headings.
If we don't find bullets in any sections with headings belonging to the second class, i.e. if we find that the document has an "Introduction" or "Executive Summary" section but we don't find bullets in those sections, then we don't return any key findings for that document.
Catch and Clean the Text in Bullets
If we find bullets, then we extract the text that is associated with these bullets, and "clean" the text as much as possible: this includes combining sections of text that should be a single key finding, separating sections when there should be multiple key findings, and removing any remaining text that should not be in the final returned list of key findings.
Final Quality Filter
We continue to move this text through various quality filters.
For some key findings, we look for contextual information to see if we're truly extracting key findings, or just a bulleted list somewhere in the document. If we suspect that the content of the extracted list might not be precisely what we're looking for, we categorize these as "generic." We also pull out lists of Recommendations, and categorize them separately (as "recommendation"), so as not to be confused with true key findings. We also check that the key findings are correctly formatted; if we detect formatting errors, we output the key findings with the category "format_error," so that a human reader might be able to edit this output. Finally, if we only extract a single key finding within a document section, we categorize them as "singleton," thereby flagging them because we're likely either missing a few key findings, or we're incorrectly extracting something that's not relevant.
Wikitopics provide a succinct way of summarizing IssueLab documents based on their similarity with content in Wikipedia. You can think of them as "tags" or "topics" that are associated with each IssueLab document. Since these topics are generated from the Wikipedia corpus, we call them "Wikitopics".
A tale of two Wikitopics
The API generates two different types of Wikitopics: micro-level Wikitopics that are based on Wikipedia articles and macro-level Wikitopics that are based on Wikipedia categories.
The basis for Article Wikitopics are individual articles in Wikipedia. Since the set of possible topics consists of all 4.5 million Wikipedia articles, Article Wikitopics are an extremely nuanced indicator of summarizing an IssueLab document, albeit at a higher level than keywords. Article Wikitopics consist of things like Water supply and sanittion in Ethiopia or Health in Indonesia.
Since each Wikipedia article belongs to one or more of the ~400 Wikipedia categories, we can summarize IssueLab documents in a more general way. Category Wikitopics consist of things like Oral hygiene or Physical exercise.
Dealing with an evolving narrative
The challenge with IssueLab documents is that they significantly vary in length and, as a consequence, the topic of documents tends to change throughout any given document. For example, a document that starts by discussing the societal issues with water sanitation may transition to discussing legal implications, economic incentives, and finally the underlying science behind improved sanitation. To address this issue, we break each document into roughly paragraph-length "chunks" before we compare it with Wikipedia.
Scoring Wikitopic relevance
We compare the text in each chunk of an IssueLab document to find the most relevant Wikipedia pages. Technically, we calculate the "relevance" based on the cosine similarity between the tf-idf vectors (a measure pioneered by Karen Sparck Jones', the namesake of this API) of the words in that document chunk. By taking a weighted average of the related Wikipedia Articles across the document chunk's, we can efficiently and effectively capture the most relevant Article Wikitopics.
Calculating the most relevant Category Wikitopics is done similarly; we use the association between Articles and Categories to compute a weighted average to find the most relevant Wikipedia Categories.