A little over a year ago, Greg Cram wrote about a pilot project we at NYPL were just beginning that aims to unlock the record of American creativity. At that point, he and our then-colleague Josh Hadro (now managing director of the IIIF Consortium) got the ball rolling, wrote a fair amount of initial documentation, and selected a vendor, DCL, to convert the first batch of volumes of the Catalog of Copyright Entries (CCE) from scanned images to parsed XML.
At the beginning of May, we completed the full run of the "Book" volumes of the CCE dating from 1923 to 1964. This gives us the best view, to date, of the number of books registered for copyright during this period in the U.S., as well as how many of these had their copyright renewed and extended.
See Chart Data below for raw numbers
The rough totals: 642,000 registered copyrights with 162,000, or 25%, renewed. Those renewed books are still in copyright today, and the copyrights on the other 480,000 books have expired and are (probably) in the Public Domain. These are initial results, so some important details and caveats follow.
Why Is This Important? Why 1923–1964?
In libraries, we like to have digital versions of books. We buy thousands of ebooks from publishers (which you can get for free with our SimplyE app); but we also have millions of older books in our research divisions that we would love to digitize and make more easily available to people doing research around the world.
During the term of copyright protection, the rights holder has the exclusive right to make and distribute copies of their book. When we want to make a digital copy at NYPL, we need either to have the rights holder's permission, or to rely on an exception or limitation in copyright law. You can go to Hathi Trust, or its non-library equivalent, Google Books, and search more books than you could ever read because indexing the content of an in-copyright book for search is considered fair use; however, presenting the full text may be an infringement.
For many years, the rule of thumb has been that any book published after 1923 was in-copyright (in the U.S.). It takes a bit of convoluted history to explain this date (but Sonny Bono appears at the end):
- The Copyright Act of 1790 set the copyright term as 14 years, renewable for another 14.
- The Copyright Act of 1831 made the term 28 years, renewable for 14.
- The Copyright Act of 1909 established a copyright term of 28 years that could be renewed for another 28.
- The Copyright Act of 1976 did away with the renewal for books in copyright on January 1, 1978 (when the act took effect) and extended the total copyright term for books already renewed to 75 years.
- The Copyright Renewal Act of 1992 pushed this date back and did away with renewal for books published after January 1, 1964.
- The Sonny Bono Copyright Term Extension Act (yes, that Sony Bono) added another 20 years to the copyright term.
So, a book published in 1922 and renewed after its first 28-year term had a copyright that lasted until 1978 (1922 + 28 + 28). If a book was published in 1923 and renewed for a second term, it would have been just at the end of its copyright in 1978 when its term was extended to 75 years. It would have been about to enter public domain again in 1998, when this was extended to 95 years, through 2018. This is why January 1, 2019 was the first day that any books had entered the public domain in more than 20 years.
But what if the book wasn't renewed? After its first copyright term, a book published in 1923 became public domain in 1951. A book published in 1963 was subject to the same copyright law. If it wasn't renewed in 1990, it became public domain at the start of 1991.
For a long time, any book published before 1923 has surely been in the Public Domain and any book published after 1963 has positively been in copyright. Between those two dates though there is a more complex zone I'll call the Renewal Era. Of course, the lack of a renewal is not quite enough to say that something is no longer in copyright. As John Ockerbloom pointed out when I initially tweeted about these results, unrenewed books might "include previously published material still under copyright, or [have been] published abroad 1st & meet certain other URAA conditions."
Registrations and Renewals
Assuming, for simplicity's sake, that none of those considerations are relevant to a particular book, if it was published during the Renewal Era and not renewed, then it is in the public domain. To figure out if a book has been renewed, you turn to the Catalog of Copyright Entries, many dozens of thick volumes published every year until 1977 (after which the copyright records became electronic).
The various volumes of the CCE contain registrations and renewals of every kind of copyrightable work including books, music, movies, artworks, and labels on commercial products. They are, in Greg Cram's words "one of the best records of American creativity." We're interested in all these things at the Library, but because books are relatively easy to digitize and use in digital form, we would like to know which ones are still in copyright and which aren't.
Since 2007, renewals have been in a searchable database at Stanford making it fairly simple to find books that have been renewed. Proving the negative, that something wasn't renewed, hasn't been as easy—typos, slight changes in titles, and other complications might cause a renewal search to fail. It has also been difficult to say what percentage of books have been renewed. Estimates, based on samples, have ranged from 7% to 33%.
While there is still plenty of work to do to clean up this data and understand some nuances of the entries, for the first time we have both ends of the copyright lifetime in a digital, ultimately searchable form for a full category of works, over a complete and continuous period of time. With the registrations now in digital form, not only do we have more information about the renewed books, we can also identify all those that do not have corresponding renewals.
What's in the Data
We are publishing the data in two repositories:
- Registrations: https://github.com/NYPL/catalog_of_copyright_entries_project
- Renewals: https://github.com/NYPL/cce-renewals
The bulk of the effort has been to convert book registrations from 1923 to 1964 into XML format. This includes Part 1, Group 1 (1923–1946), Part 1A (1947–1953), and Part 1 (1953–1964) of the CCE. In addition, we have created a new version of the renewals in tab-delimited format (the same information found in the Stanford database, but parsed differently to work more accurately with the registrations).
The renewal data contains both halves of Part 1 (Groups 1 and 2, Parts 1A and 1B) as well as their combined versions for 1950-1977, parsed from a transcription made by Project Gutenberg. For the years 1978 on, there are registrations for all classes taken from a version of the renewals exported from the Copyright Office database and hosted by Google.
Beginning with July 1953, the "Book" volume is Part 1, "Books and Pamphlets, Including Serials and Contributions to Periodicals." Prior to this, pamphlets, serials, and contributions to serials (and sermons, lectures, and many other things) were published separately as Group 2 or Part 1B, which are not included in this data yet. For the first half of 1953, there are about 8,200 entries from 3rd series, volume 7, part 1A, number 1; for the second half of the same year, there are more than 20,000 entries because 3rd series, volume 7, part 1, number 2 included everything that previously would have been published separately in part 1B.
Books and Not-Books
Every registration is assigned to a class as indicated by the letter prefix of its registration number: "A" for books, "B" for serials, "D" for dramas, etc. This nominally corresponds to the division into volumes so we would expect all the "D"s to be in the "Dramatic Compositions" volume (Part 1, Group 3, later Part 3). In practice this is not the case—Eugene O'Neill's A Moon for the Misbegotten, for instance, is included in Part 1, Group 1 (1952; DP1117) along with a few hundred other class "D" registrations.
We might wonder why DP1117 wasn't published in group 3 with the other "D"s or why, if it's more like a book somehow, it wasn't given an "A" number. It begs the question, though, are there any class "A" entries in Group 1 or Part 1A that someone might class as plays? I was able to find 100 entries that have "… a play in …" in the title, from Hilda; a play in four acts by Frances Guignard Gibbes (1923; A696442) to Seven devils from Magdala; a play in three acts.
Because of examples like this, I think it's fairly fruitless to try to determine what is a book or a "book proper" from the information in the CCE, so we have simply counted the contents of the volumes we have digitized. "A Moon for the Misbegotten" was renewed as were about 15% of the class 'D' entries in Group 1/Part 1A. The situation is worse with class "A" entries, where the not-very-well-held distinction between books (class "A") and non-books (classes such as "AA" and "A5") is partly erased after 1953. "AA" is done away with and presumably collapsed into "A". "A5" continued first as "B5" and then as "BB".
That said, inclusion in Group 1/Part 1A turns out to be a pretty good predictor of the kinds of things that tend to be renewed. If we look again at 1953, Part 1 Number 2, the second half of year with the two groups combined has 153% more entries than Part 1A Number 1 (20,811 vs. 8,217), but only 30% more of those are renewed (2,820 vs. 2,154). This implies something like a 5% renewal rate for Group 2/Part 1B entries. Many of those few renewed items may, in fact, be books. We recently learned that children's books, for instance, were routinely lumped in with "pamphlets."
Because of this change in the way the CCE was arranged, the count of renewals presented for 1953-63 must include some things that aren't "books". We also imagine some things that are "books" aren't counted for the years before 1953 because they are in Group 1/Part 1B, which we haven't converted yet. Also, because the count of unrenewed entries ("books" and "non-books") would be so much higher for 1953-63, I chose to estimate what would have been in part 1A if the 1A/1B distinction had continued. Non-renewed entries are estimated at 3.7 times the number of renewals. This is based on two generalizations: everything renewed is a book (close to true) and the 27% average renewal rate for 1946-1952 held for 1953-1964.
The only class of entries that has been excluded from the count are interim registrations (class "AI") since they would be an obvious source of undercounting or double counting, depending on how renewals are matched to registrations. Ultimately, what we really want to be able to do is count copyrights rather than entries by grouping interim (AI) and foreign (AF) registrations together with corresponding A entries as a single entity. A handful of entries in each volume is very complicated to parse and have also been ignored for now. These tend to be things like dozens of issues of Bell System technical bulletins and aren't particularly interesting for this analysis.
Two obvious tasks lay before us: correcting the data and adding more data. Beyond that, I'm sure many people would like to see an online interface for exploring the entries. Linking the data both internally—entry to entry—and to external identifiers would make it really useful in the library world.
Correcting the Data
The XML files for the completed volumes of the CCE amount to 687 MB of data, all of which has been scanned, OCRed, keyed, and tagged so we expect a certain number of errors might occur at each step. We are focusing mostly on the accuracy of ID numbers so that registrations and renewals can be correctly paired; fortunately, there are things we can do to chase down many mistakes. For instance, within the new series or third series, registration numbers should be unique and duplicates can be investigated (the light printing of some pages make 0's, 3's, 6's and 8's especially difficult for OCR to distinguish). Frequently, the errors are typos in the CCE entries themselves.
Anyone who works with bibliographic data knows how difficult the many variations of authors' and publishers' names can be to deal with. Though the tagging of these fields is currently accurate enough to be very useful, this is probably the area most in need of correction. Even better would be to link authors and publishers to VIAF (Virtual International Authority File) and other identifiers.
It is clear from the discussion above that, even if your interest is only books, the pre-1953 "pamphlet" volumes (Part 1 Group 2 and Part 1B) are still important. Beyond the books, the CCE covers every kind of creative endeavor and these volumes have a great deal of value as an historical record. Having a complete historical record, however, would mean converting not only the volumes for the years in which copyright is in question, but also the pre-1923 and post-1964 volumes. We are, at the moment, planning to do later volumes of Part 1, and would be happy to collaborate with anyone who wanted to take on any part of the CCE.
There are internal and external links that can be made. Links between registrations and renewals are explicit, but links between a registration and a previous interim registration, or to an original entry when new matter is being registered, are not always present.
Probably the most useful links would be between the registrations and equivalent records in other sources. Through 1937, the entries contain Library of Congress Control Numbers, which is a key to linking them to OCLC (Online Computer Library Center) records and Hathi Trust. It would be wonderful to have a way to make connections between these sources and entries from other years. Having an LCCN or OCLC number corresponding to a registration would make it easier to correctly link VIAF ids for authors and publishers, in order to make those searches more accurate.
Books are counted under the year of their registration rather than publication in the CCE. That is, a book with a 1950 registration date may be published in the 1950 volume of the CCE, but there is a good chance it appears in the 1951 volume, a smaller chance in the 1952 volume, and so on. Therefore, these numbers will not match the entry counts given in each printed volume since those are counts by publication rather than registration year.
# Not Renewed
# Not Renewed (estimated)