U.S. Copyright History 1923–1964
A little over a year ago, Greg Cram wrote about a pilot project we at NYPL were just beginning that aims to unlock the record of American creativity. At that point, he and our then-colleague Josh Hadro (now managing director of the IIIF Consortium) got the ball rolling, wrote a fair amount of initial documentation, and selected a vendor, DCL, to convert the first batch of volumes of the Catalog of Copyright Entries (CCE) from scanned images to parsed XML.
At the beginning of May, we completed the full run of the "Book" volumes of the CCE dating from 1923 to 1964. This gives us the best view, to date, of the number of books registered for copyright during this period in the U.S., as well as how many of these had their copyright renewed and extended.
See Chart Data below for raw numbers
The rough totals: 642,000 registered copyrights with 162,000, or 25%, renewed. Those renewed books are still in copyright today, and the copyrights on the other 480,000 books have expired and are (probably) in the Public Domain. These are initial results, so some important details and caveats follow.
Why Is This Important? Why 1923–1964?
In libraries, we like to have digital versions of books. We buy thousands of ebooks from publishers (which you can get for free with our SimplyE app); but we also have millions of older books in our research divisions that we would love to digitize and make more easily available to people doing research around the world.
During the term of copyright protection, the rights holder has the exclusive right to make and distribute copies of their book. When we want to make a digital copy at NYPL, we need either to have the rights holder's permission, or to rely on an exception or limitation in copyright law. You can go to Hathi Trust, or its non-library equivalent, Google Books, and search more books than you could ever read because indexing the content of an in-copyright book for search is considered fair use; however, presenting the full text may be an infringement.
For many years, the rule of thumb has been that any book published after 1923 was in-copyright (in the U.S.). It takes a bit of convoluted history to explain this date (but Sonny Bono appears at the end):
- The Copyright Act of 1790 set the copyright term as 14 years, renewable for another 14.
- The Copyright Act of 1831 made the term 28 years, renewable for 14.
- The Copyright Act of 1909 established a copyright term of 28 years that could be renewed for another 28.
- The Copyright Act of 1976 did away with the renewal for books in copyright on January 1, 1978 (when the act took effect) and extended the total copyright term for books already renewed to 75 years.
- The Copyright Renewal Act of 1992 pushed this date back and did away with renewal for books published after January 1, 1964.
- The Sonny Bono Copyright Term Extension Act (yes, that Sony Bono) added another 20 years to the copyright term.
So, a book published in 1922 and renewed after its first 28-year term had a copyright that lasted until 1978 (1922 + 28 + 28). If a book was published in 1923 and renewed for a second term, it would have been just at the end of its copyright in 1978 when its term was extended to 75 years. It would have been about to enter public domain again in 1998, when this was extended to 95 years, through 2018. This is why January 1, 2019 was the first day that any books had entered the public domain in more than 20 years.
But what if the book wasn't renewed? After its first copyright term, a book published in 1923 became public domain in 1951. A book published in 1963 was subject to the same copyright law. If it wasn't renewed in 1990, it became public domain at the start of 1991.
For a long time, any book published before 1923 has surely been in the Public Domain and any book published after 1963 has positively been in copyright. Between those two dates though there is a more complex zone I'll call the Renewal Era. Of course, the lack of a renewal is not quite enough to say that something is no longer in copyright. As John Ockerbloom pointed out when I initially tweeted about these results, unrenewed books might "include previously published material still under copyright, or [have been] published abroad 1st & meet certain other URAA conditions."
Registrations and Renewals
Assuming, for simplicity's sake, that none of those considerations are relevant to a particular book, if it was published during the Renewal Era and not renewed, then it is in the public domain. To figure out if a book has been renewed, you turn to the Catalog of Copyright Entries, many dozens of thick volumes published every year until 1977 (after which the copyright records became electronic).
The various volumes of the CCE contain registrations and renewals of every kind of copyrightable work including books, music, movies, artworks, and labels on commercial products. They are, in Greg Cram's words "one of the best records of American creativity." We're interested in all these things at the Library, but because books are relatively easy to digitize and use in digital form, we would like to know which ones are still in copyright and which aren't.
Since 2007, renewals have been in a searchable database at Stanford making it fairly simple to find books that have been renewed. Proving the negative, that something wasn't renewed, hasn't been as easy—typos, slight changes in titles, and other complications might cause a renewal search to fail. It has also been difficult to say what percentage of books have been renewed. Estimates, based on samples, have ranged from 7% to 33%.
While there is still plenty of work to do to clean up this data and understand some nuances of the entries, for the first time we have both ends of the copyright lifetime in a digital, ultimately searchable form for a full category of works, over a complete and continuous period of time. With the registrations now in digital form, not only do we have more information about the renewed books, we can also identify all those that do not have corresponding renewals.
What's in the Data
We are publishing the data in two repositories:
- Registrations: https://github.com/NYPL/catalog_of_copyright_entries_project
- Renewals: https://github.com/NYPL/cce-renewals
The bulk of the effort has been to convert book registrations from 1923 to 1964 into XML format. This includes Part 1, Group 1 (1923–1946), Part 1A (1947–1953), and Part 1 (1953–1964) of the CCE. In addition, we have created a new version of the renewals in tab-delimited format (the same information found in the Stanford database, but parsed differently to work more accurately with the registrations).
The renewal data contains both halves of Part 1 (Groups 1 and 2, Parts 1A and 1B) as well as their combined versions for 1950-1977, parsed from a transcription made by Project Gutenberg. For the years 1978 on, there are registrations for all classes taken from a version of the renewals exported from the Copyright Office database and hosted by Google.
Beginning with July 1953, the "Book" volume is Part 1, "Books and Pamphlets, Including Serials and Contributions to Periodicals." Prior to this, pamphlets, serials, and contributions to serials (and sermons, lectures, and many other things) were published separately as Group 2 or Part 1B, which are not included in this data yet. For the first half of 1953, there are about 8,200 entries from 3rd series, volume 7, part 1A, number 1; for the second half of the same year, there are more than 20,000 entries because 3rd series, volume 7, part 1, number 2 included everything that previously would have been published separately in part 1B.
Books and Not-Books
Every registration is assigned to a class as indicated by the letter prefix of its registration number: "A" for books, "B" for serials, "D" for dramas, etc. This nominally corresponds to the division into volumes so we would expect all the "D"s to be in the "Dramatic Compositions" volume (Part 1, Group 3, later Part 3). In practice this is not the case—Eugene O'Neill's A Moon for the Misbegotten, for instance, is included in Part 1, Group 1 (1952; DP1117) along with a few hundred other class "D" registrations.
We might wonder why DP1117 wasn't published in group 3 with the other "D"s or why, if it's more like a book somehow, it wasn't given an "A" number. It begs the question, though, are there any class "A" entries in Group 1 or Part 1A that someone might class as plays? I was able to find 100 entries that have "… a play in …" in the title, from Hilda; a play in four acts by Frances Guignard Gibbes (1923; A696442) to Seven devils from Magdala; a play in three acts.
Because of examples like this, I think it's fairly fruitless to try to determine what is a book or a "book proper" from the information in the CCE, so we have simply counted the contents of the volumes we have digitized. "A Moon for the Misbegotten" was renewed as were about 15% of the class 'D' entries in Group 1/Part 1A. The situation is worse with class "A" entries, where the not-very-well-held distinction between books (class "A") and non-books (classes such as "AA" and "A5") is partly erased after 1953. "AA" is done away with and presumably collapsed into "A". "A5" continued first as "B5" and then as "BB".
That said, inclusion in Group 1/Part 1A turns out to be a pretty good predictor of the kinds of things that tend to be renewed. If we look again at 1953, Part 1 Number 2, the second half of year with the two groups combined has 153% more entries than Part 1A Number 1 (20,811 vs. 8,217), but only 30% more of those are renewed (2,820 vs. 2,154). This implies something like a 5% renewal rate for Group 2/Part 1B entries. Many of those few renewed items may, in fact, be books. We recently learned that children's books, for instance, were routinely lumped in with "pamphlets."
Because of this change in the way the CCE was arranged, the count of renewals presented for 1953-63 must include some things that aren't "books". We also imagine some things that are "books" aren't counted for the years before 1953 because they are in Group 1/Part 1B, which we haven't converted yet. Also, because the count of unrenewed entries ("books" and "non-books") would be so much higher for 1953-63, I chose to estimate what would have been in part 1A if the 1A/1B distinction had continued. Non-renewed entries are estimated at 3.7 times the number of renewals. This is based on two generalizations: everything renewed is a book (close to true) and the 27% average renewal rate for 1946-1952 held for 1953-1964.
The only class of entries that has been excluded from the count are interim registrations (class "AI") since they would be an obvious source of undercounting or double counting, depending on how renewals are matched to registrations. Ultimately, what we really want to be able to do is count copyrights rather than entries by grouping interim (AI) and foreign (AF) registrations together with corresponding A entries as a single entity. A handful of entries in each volume is very complicated to parse and have also been ignored for now. These tend to be things like dozens of issues of Bell System technical bulletins and aren't particularly interesting for this analysis.
Further Work
Two obvious tasks lay before us: correcting the data and adding more data. Beyond that, I'm sure many people would like to see an online interface for exploring the entries. Linking the data both internally—entry to entry—and to external identifiers would make it really useful in the library world.
Correcting the Data
The XML files for the completed volumes of the CCE amount to 687 MB of data, all of which has been scanned, OCRed, keyed, and tagged so we expect a certain number of errors might occur at each step. We are focusing mostly on the accuracy of ID numbers so that registrations and renewals can be correctly paired; fortunately, there are things we can do to chase down many mistakes. For instance, within the new series or third series, registration numbers should be unique and duplicates can be investigated (the light printing of some pages make 0's, 3's, 6's and 8's especially difficult for OCR to distinguish). Frequently, the errors are typos in the CCE entries themselves.
Anyone who works with bibliographic data knows how difficult the many variations of authors' and publishers' names can be to deal with. Though the tagging of these fields is currently accurate enough to be very useful, this is probably the area most in need of correction. Even better would be to link authors and publishers to VIAF (Virtual International Authority File) and other identifiers.
We welcome correction from any source. If you think you have spotted an error, you can add an issue in the repository for registrations or renewals.
More Data
It is clear from the discussion above that, even if your interest is only books, the pre-1953 "pamphlet" volumes (Part 1 Group 2 and Part 1B) are still important. Beyond the books, the CCE covers every kind of creative endeavor and these volumes have a great deal of value as an historical record. Having a complete historical record, however, would mean converting not only the volumes for the years in which copyright is in question, but also the pre-1923 and post-1964 volumes. We are, at the moment, planning to do later volumes of Part 1, and would be happy to collaborate with anyone who wanted to take on any part of the CCE.
Linking Data
There are internal and external links that can be made. Links between registrations and renewals are explicit, but links between a registration and a previous interim registration, or to an original entry when new matter is being registered, are not always present.
Probably the most useful links would be between the registrations and equivalent records in other sources. Through 1937, the entries contain Library of Congress Control Numbers, which is a key to linking them to OCLC (Online Computer Library Center) records and Hathi Trust. It would be wonderful to have a way to make connections between these sources and entries from other years. Having an LCCN or OCLC number corresponding to a registration would make it easier to correctly link VIAF ids for authors and publishers, in order to make those searches more accurate.
Chart Data
Books are counted under the year of their registration rather than publication in the CCE. That is, a book with a 1950 registration date may be published in the 1950 volume of the CCE, but there is a good chance it appears in the 1951 volume, a smaller chance in the 1952 volume, and so on. Therefore, these numbers will not match the entry counts given in each printed volume since those are counts by publication rather than registration year.
Year
# Renewed
# Not Renewed
# Not Renewed (estimated)
Total
Percentage Renewed
1923
1593
7198
8791
18.12%
1924
1633
7819
9452
17.28%
1925
1796
8869
10665
16.84%
1926
1955
9436
11391
17.16%
1927
2185
10413
12598
17.34%
1928
2384
11822
14206
16.78%
1929
2697
11161
13858
19.46%
1930
2559
11844
14403
17.77%
1931
2726
10761
13487
20.21%
1932
2677
9880
12557
21.32%
1933
2495
8925
11420
21.85%
1934
2666
9454
12120
22.00%
1935
2875
9691
12566
22.88%
1936
2989
9939
12928
23.12%
1937
3201
9674
12875
24.86%
1938
3242
10020
13262
24.45%
1939
3109
8990
12099
25.70%
1940
3374
9068
12442
27.12%
1941
3451
7353
10804
31.94%
1942
3229
5896
9125
35.39%
1943
2814
5198
8012
35.12%
1944
2585
4868
7453
34.68%
1945
2444
5971
8415
29.04%
1946
2954
8751
11705
25.24%
1947
3583
9788
13371
26.80%
1948
3544
8901
12445
28.48%
1949
3568
9930
13498
26.43%
1950
4257
11122
15379
27.68%
1951
4255
11167
15422
27.59%
1952
4138
11920
16058
25.77%
1953
5160
13951
19111
27.00%
1954
5915
15992
21907
27.00%
1955
5984
16179
22163
27.00%
1956
5925
16019
21944
27.00%
1957
6731
18199
24930
27.00%
1958
6787
18350
25137
27.00%
1959
7256
19618
26874
27.00%
1960
7420
20061
27481
27.00%
1961
7503
20286
27789
27.00%
1962
8017
21676
29693
27.00%
1963
8740
23630
32370
27.00%
Total
162416
642206
25.29%