Visual Analytics Primer Part 2: Analyzing Textual Data

This is part 2 in a series of write-ups based on common questions I get in TA hours for Visual Analytics at Smith College. Feel free to reach out to me and request future topics! This reading will be helpful for people who are trying to analyze written text. A note — code examples will be written in Python in this tutorial, but there are similar tools available in R.

File Encodings

The first and most important step of analyzing a text is reading it into memory. No one could disagree with that one. However, reading text is a bit more complicated for computers than it is for humans. Computers only store numbers as binary values, so in order to make human readable text, they encode characters as specific numbers. Character encodings are standardized. Some character encodings you might be familiar with include ASCII and UTF-8. In order to read a text file in Python, you’ll need to make sure that you specify the correct character encoding in your code. ASCII is one of the earliest character encodings developed and it was able to represent every character in the English language in 8 bytes. Unicode is a superset of ASCII which contains pretty much any character that you can think of. Python will default to UTF-8, one of multiple Unicode formats, if no encoding is specified. However, if you’re running into error messages reading in a file, it might be worth trying out different encodings, like Latin 1, and seeing if that works. To learn more, check out this article which delves into the history and decision making process behind character encodings.

PDF data can also be read into Python by using the right library. A common one is PyPDF2 and it can be used to extract text from a PDF document for further analysis. Even .docx files can be read into Python using the right library.

Finding Patterns with Regular Expressions

When working with a large corpus of text, it’s possible that you’ll want to find all the instances of a string which match a specific pattern inside of the corpus. It’s possible that you could be looking for email addresses, dates, addresses, phone numbers, etc. One way to find strings like this is to use regular expressions. Regular expressions are a way of describing patterns in strings. They can’t fit all patterns you might want to look for. Palindromes, for example, aren’t possible to find with regular expressions (take CSC 250 to find out why). They are still very powerful, even with these restrictions. R and Python both have tools for searching a text for all instances of a string that fits a given regular expression.

For example, let’s say that I’m searching for all street addresses from 00–99 Sesame Street or 00–99 Melody Lane. I can create a regular expression which will say “match all strings where the first character is 0–9, the second character is 0–9, and then the last characters are Melody Lane OR they are Sesame Street.” Notation for regular expressions can vary between languages, but I’m going to use Pythonic regular expression syntax.

[0123456789][0123456789]( Melody Lane| Sesame Street)

The above regular expression will match all instances of that string pattern. First, it checks that the first character in the string is in the set {0…9} (I chose to write out all the numbers as an example but there are more brief ways to describe the set). Then, it checks the same thing for the second character. After that, it makes sure that the following characters are either “ Melody Lane” OR “ Sesame Street”, and if all those conditions are satisfied, it matches! There are ways to design more sophisticated regular expressions than this one as well. Someone could even design one which finds all dates in the newspaper articles for the first Data Challenge with a little bit of time.

There are a lot of websites out there for testing regular expressions to make sure that they work and are efficient. I personally like this one!

Keyword Extraction

If you’re interested in taking your analysis up a notch, keyword extraction might just be the way to go. Keyword extraction is useful when you’re trying to find central words/ideas in a given text. Personally, I like to use the library summa’s Textrank model in order to find keywords. It works on the same principle as the early search algorithm that Google used in order to find webpages. R has a similar package called Textrank. Both turn the given text into a graph, using similarity to form edges, and then finding the words with the largest in-degree. Those words are the keywords. For those of you that have taken CSC 212, the idea of graphs should be familiar. If you haven’t, I highly recommend this Wikipedia article on Pagerank and this Britannica article on graph theory. You can also check out my primer on graph databases here! Google used this for webpages by making each webpage into a node and each link from a website to another into a directed edge.

Using NLTK for NLP

NLTK, the Natural Language Toolkit, is a library developed for Python by Stanford University. In this context, a natural language is one used by humans in order to communicate as opposed to a formal language like a programming language which follows a strict logical syntax. NLP is Natural Language Processing, using a computer program in order to analyze a given natural language text. NLTK can be used for a variety of tasks, including figuring out the part of speech of a word in a sentence, splitting sentences into words (tokenizing them), and identifying the relationships between entities in a given text. NLTK can even be used to turn words into vector representations (points in multidimensional space), making it much easier to put them into machine learning algorithms.

This book is the most comprehensive guide for NLTK available for free online! In it, you’ll find information on downloading, configuring, and using NLTK in order to process a given corpus.

TL;DR

There are a lot of ways to analyze text. Many more than I even wrote about in this article. However, regular expressions, NLTK, and keyword extraction are are all good ways to get started.

Also, I’d like to give a special thanks to professor John Foley at Middlebury College for teaching me these skills and reading over this article!

Programmer and student at Smith College in Northampton, Massachusetts. Aspiring data scientist.