This is part 1 in a series of write-ups based on common questions I get in TA hours for Visual Analytics at Smith College. Feel free to reach out to me and request future topics! This reading will be helpful for people who are trying to structure their findings when doing data analysis.
How can you model data? (A.K.A. the quickest introduction to database theory possible)
Let’s say that you have millions of data points and you want to find some way to store them. How are you going to do it?
If there are common attributes about all of your data points, you can use that to your advantage. Say that you’ve been commissioned to help out the Northampton Public Library with designing their new book catalogue system. They know that each book has a title, an author, and a number in the library. While books might have repeated titles or authors, they all have unique numbers. We can call this number the “key” of the book, similar to the key in the back of an encyclopedia. It’s important that our keys are unique because we don’t want to store redundant information. If the library had 50 books and they asked you to draw out a way to organize the books, how might you do it?
You might start by creating a collection of cards sorted by key with all of the other book information written on the card. Then, if you know the key to the book that you’re looking for, you just have to get that card. Easy, right? This is the theory behind a “key-value database.” It’s also the theory behind a dictionary, where your key is the word you’re looking for, and your value is its definition. They’re a useful form of database, but they don’t represent the relationship between data points well. For example, Charles Dickins wrote multiple novels, but using the keys at our libraries, there’s no way for us to look for all of the books by Charles Dickens without checking all of our cards. Important to note, a key can also be known as an index!
This kind of relationship is called a “many to one” relationship because there are many objects (books) that relate to another object (authors). We can model it using a hierarchical model, which for those of you who have taken/are taking data structures, is a tree! One way to model this would be as follows — library would be the root node, Charles Dickins would be a child of the root, and his books would be the leaves.
Now, we run into a problem with this model too. Northampton Library has the book Good Omens, written by Terry Pratchett and Neil Gaiman. There’s two authors for it and it’s not like we want to make two versions of the same book, one for each author. We need to model a “many to many” relationship because many objects (books) can connect to many other objects (authors). There are ways to model this with a hierarchical model, but they’re not as elegant. This is where the relational model of databases comes into play.
Relational databases are similar to excel sheets in that they have a predetermined set of columns and then each data point is a row, with one key difference. Each sheet can be “joined” with other sheets by using the keys. They match each data point in one sheet (hereafter called a table) with the corresponding data points in other sheets by their keys and then can query the combined tables. This is the most standard database model in use today and it’s what is used in both pandas and R. A many to many relationship can easily be stored in a relational database by just making a table linking keys from a table of authors to keys from a table of books and joining on those keys. This is the optimal data model for Northampton Public Library.
Keys can be like our cards in the library, where there’s one specific value which is unique to each data point, or they can be “super keys” which are multiple values in a database that can be used to mark out a data point. My full name is three attributes (my first, middle, and last names) which are a unique super key for me. My social security number is a primary key because it’s one attribute of mine and (hopefully) no one else shares it! Through carefully designing the columns of your dataset with the keys in mind, you can also reduce redundancy. This is called normalization.
Relational algebra is the logic/mathematics of working with relational data. It is an extension of set theory. Dataframes in pandas and R are both based on the principles of relational algebra.
How can I work with relational data?
For those of you who have taken SDS 192, this will all be old hat. For those of you that haven’t, there are several basic logical actions you can take with a relational data set. You’re already familiar with joining data together. You can also project on a data set, which will only return information on the columns you ask for. For example, for my library data set I can project authors and get Charles Dickins, Neil Gaiman, and Terry Pratchett. You can filter your data, returning only data points which fulfill certain requirements. You can filter authors==”Charles Dickins” to get only books written by Charles Dickins in my data set. You can also
R, Python, SQL, and other tools may all use different words for the same actions, but the basic ideas are the same. Sometimes, they use the same words for different ideas. For example, projection is called select in R and pandas. Selection, as the term is used in relational algebra and database theory, is called filtering in R. Filtering in pandas can technically be selection or projection. I chose to stick with the terms used in relational algebra where possible.
How do I use it?
When you’re trying to build a data set, structure it so that you have a set of columns of distinct attributes that you want to find and then each data point is a row! Make sure that you decide ahead of starting to build your data set what your columns will be, otherwise your data will get messy. If you have to model a many to many relationship, use a junction table/associative array to store the relationships between data points. Be mindful of the keys that you use in your data to prevent redundancies in your data sets. This way, you can design data sets in R and pandas that follow the principles of relational databases and therefore are the most memory and space efficient that they can be.
Decide ahead of time what columns you want for your data sets and your data wrangling will be that much easier.