The Networked Catalog

By Matt Miller, NYPL Labs
July 31, 2014

The catalog could be considered the backbone of any working library. Performing an essential service, the catalog informs staff and patrons what materials are held and where in the library they can be found. As libraries have changed throughout history, the catalog, and how it has been used, has evolved as well. Some of the earliest catalogs still in existence, originating from medieval libraries, were simply lists written on the inside covers and margins of books in the collection. For these early libraries, containing a couple hundred volumes at most, this system was sufficient to keep track of the collection. However, during the Renaissance, as libraries grew in size, external catalogs started being produced which listed all the works held in the library. This guide was primarily for the library staff to consult when a visitor was interested in a specific volume. They would then go off and retrieve the item. It wasn't until Victorian Britain that the printed catalog was produced with the thought of patron use in mind, ushering in a new concept: The catalog as tool for discovery. Of course catalogs continued to evolve, inhabiting the iconic card catalog in the late 19th century and then into computerized systems in the late 20th century, but with the same mandate, organization and discovery.

At NYPL Labs, we are fascinated with our catalog and the possibilities its data represents. Just as the catalog has changed in the past we wonder what other possible forms it could take today, and in the future. With this driving thought we conducted a preliminary experiment: what if the catalog had a "See All" button? What if you could see everything at once, to get the big picture about what subjects the library has information on and what are the related topics? To even imagine this conceptually is no easy task, unlike our medieval predecessors, The New York Public Library holds many millions of items. But the catalog comes to the rescue, every item has been manually assigned subject headings or terms that describe what that resource is about. For example, here are the subject headings for Doris Kearns Goodwin's A Team of Rivals:

Subjects Headings for A Team of Rivals

If we were to aggregate the subject headings for all of our resources we would be able to see the big picture, what subjects the library has the most information about. This idea of knowledge mapping has been around for hundreds of years, usually emerging from the desire to map the history of ideas or time. They have traditionally been visually conveyed as a temporal map or timeline. For example, Joseph Priestley's A New Chart of History from 1769.

Joseph Priestley's A New Chart of History from 1769. Courtesy of MediaWiki Commons

Returning to our subject headings we can notice another quality, not only do they categorize the item, they also indicate a relationship between the subjects. Because of this relationship we can rightly assume that the topic of Abraham Lincoln is related in some way to United States.

Possible subjects heading relationships for A Team of Rivals

Combining these two concepts we get a complete statement: The more a subject heading is used indicates a strength in NYPL's collection, and a large co-occurrence of subjects points to a strong relationship between those two subjects. When thinking about the catalog in these terms it takes on the form of a large network, a galaxy of interconnected subject headings. This network metaphor fits well when it comes to thinking about the library and all the interconnected knowledge it holds.

To transform this thought experiment into reality we used tools from the field of Network Analysis to process our catalog. Simply put, the form of network analysis we used could be thought of as a physics simulation. You put in a bunch of connected subject headings of various sizes (number of uses), set the gravity and repulsion strengths, and press play. The result is a self-organization of the subject headings based on how large they are and their relationships to other subjects. For example, if Abraham Lincoln was in one part of the network and the United States is in another, and they have a very strong relationship (meaning used together many times) then they are going to want to be together in the network. Using networks in this way is a fantastic way to start seeing patterns and groupings of related subjects.

We ran this simulation for over 430,000 subject headings that were interconnected with 11 million relationships. Below is a time-lapse video of the simulation playing out over the course of five days: 

NYPL Subject Heading Network from NYPL Labs on Vimeo.

This is a extremely zoomed out view of the process, each tiny speck represents a subject headings trying to find a place in the larger network based on its relationship. After the dust settled we took that information and drew an image based on where every subject heading ended up. We also ran some algorithms to group the subject headings together into communities based on their shared connection and assigned that group a color. The result was an impressive view of our catalog:

Click to view interactive version.

We also created an interactive view of the network, allowing you zoom in and explore. Click on subject to see some possible resources and a search tool that gets you in the vicinity of a requested subject. This visualization could be considered a map of NYPL's knowledge, showing what the major subjects we have holdings for:

Mapped sections of the network.

As this is the first pass at thinking of the catalog as a network you notice a few kinks, with some subjects overly bunched together or overlapping. But it is an exciting first step. We will continue to refine this network approach to the catalog but we are always thinking ahead. What other new forms could the catalog take to help explore the vast materials living at NYPL? Stay tuned.