Text Mining with iAnalyzer

iAnalyzer is a simple online tool for searching through / querying text corpora.

We will use it today in the exercises in order to perform a search in the available text corpora and download the result (in the form of a CSV file), which you will use in the next session for further analysis in R.

Slides

Exercise:

Before starting, create a folder for this course and a subfolder for today’s session in an accessible location on your computer.

To perform a first search, follow these steps:

  1. Go to the iAnalyzer website: https://ianalyzer.hum.uu.nl/
  2. Login with your solisID
  3. Perform a search with these criteria:
Corpus Times (select at the top left from the dropdown menu ‘corpora’)
Query ‘European Union’
Publication Date From 1945 onwards
OCR confidence 80-100 (explanation of the OCR score is at the bottom of the page)
Page Type Standard
On Front Page Yes (So leave ‘false’-box unchecked)
Publication Date From 1945 onwards
Category News
Illustration No (So leave all boxes unchecked)
  1. Check out the different visualizations (in the top, right next to the header ‘Filters’)
  2. Download the data in a CSV format (At the top of the page, “download CSV”). Make sure that before you download the file, you click on the small gears icon next to the download button, click “show default fields”, and check the box “content”. Otherwise, your downloaded file will not contain the actual text of the articles.

Cheatsheet for Search Queries

Operator Description Example/Explanation
+ means AND bank +assets
| means OR bank |assets -> Note that OR is already the default way to combine search terms, so bank assets would be sufficient in this example.
- means NOT bank -assets
” ” entire phrase allows the search for an entire phrase: “the assets of the bank”
* wildcard A wildcard for any number of characters, e.g. bank* will match banking, banks, banked, etc. The wildcard is only allowed at the end of a word, and cannot be used with phrases (between ” quotes).
~N fuzzy search Describes fuzzy search. When placed after a term this signifies how many characters are allowed to differ. So bank~1 also matches bang, sank, dank etc. When placed after a phrase, this signifies how many words may differ

Explanation OCR scores

OCR stands for “Optical character recognition” and it is a key tool for text mining. OCR uses machine learning to extract words and lines of text from scans and images, which can then be used to perform quantitative text analysis or natural language processing.As you can imagine, dependent on the image/scan quality, the writing (machine vs. hand-written) etc., the automated recognition of characters will not be perfect. The OCR score is a metric used to quantify the accuracy of the text extraction (in other words, the score reflects how likely it is that there are errors in the text extraction)