Exploring Open Data: Public Domain Works in Wikidata
The Current State of Open Data on Public Domain Works
A public domain work is essentially any creative material that is not protected by copyright law. The internet has made the content of these works wonderfully findable and accessible by the general public. Surprisingly, retrieving the metadata on these works is an entirely different story, perhaps precisely because there is no IP. Public domain art archives have made the biggest strides; the Metropolitan Museum of Art and the Art Institute of Chicago have been pioneers in creating rich knowledge structures for their archives and making that data open via robust APIs. Several art institutions such as Rijksmuseum in the Netherlands, Harvard Art Museums, and the Smithsonian are not far behind. As such, a user or agent wanting to query and build public art knowledge graphs can expect to do so without too much friction. Unfortunately, outside of art, this is much less the case. In the area of literature, Project Gutenberg and Open Library, excellent sources for reading public domain text, do open up their metadata, but queryability, schemas, and comprehensiveness are lacking. Similarly, with public domain music, music can be readily accessed through various internet archives and music scores through the International Music Score Library Project, but complete and accurate metadata is difficult to retrieve. Fortunately, MusicBrainz, an open encyclopedia for music data, is facilitating the ongoing curation and access for much of this information. Public domain film metadata does not appear to be widely available in clean, open, queryable formats, but as exponentially more films enter the public domain in the coming decades, the hope is this will change.
The US government itself does provide excellent metadata for public domain works they have reason to archive. The National Archives, the National Gallery of Art, NASA, and the Library of Congress not only provide excellent access to public works, but have made metadata accessible as well. The Library of Congress in particular has an excellent API for open data retrieval. The ongoing initiative data.gov should help more government datasets become queryable, including datasets dealing with public domain works.
Any user or agent wanting open data on public domain works must at the very least be able to ask with confidence: is this work in the public domain? Unfortunately, very few open data systems are able to directly answer that question. The ideal is to have a queryable “isPublicDomain” data field, like the Metropolitan Museum of Art exposes. Less ideal, but adequate, is to require copyright data for the work, allowing a user to infer public domain status if copyright data does not exist. Several archives do this. Many internet archives provide the content of the uncopyrighted work but do not expose any metadata. And at worst, some systems provide the metadata, but a user or agent cannot trust the accuracy or completeness of a query for public domain status.
Exploring Public Domain Works in Wikidata using OpenDataDEx
The scattered nature of public domain data makes Wikidata a valuable central source for comparing public domain works across types and institutions. At Wikidata’s SPARQL endpoint, a user or agent can retrieve a single result set that can include works as varied as the Great Sphinx in Egypt, Shakespeare’s Macbeth, Van Gogh’s Starry Night, and Beethoven’s Symphony No. 5. And the details of that dataset will be structured knowledge, adhering to Semantic Web principles, so that the user or agent can query on relationships and integrate that data into their own systems. Furthermore, Wikidata has a commitment to Linked Data principles, so many of these public domain entities include ID properties that link to the various institutions and archives that curate these works. This enables retrieving more robust metadata from trusted sources.
Although Wikidata contains the entities representing notable public domain works, because data is curated by various communities and individuals, the quality of details on those entities varies widely. As such, retrieving a good subset of objects (entities with Wikipedia pages) and tags (property:value pairs) for navigation in OpenDataDEx, the Open Data Explorer, is difficult to do. Generated graphs show heavy clustering, indicating that certain subjects are much more robust in Wikidata than others. Meanwhile, clusters you would expect do not exist because properties are missing. However, there is enough metadata to provide interesting, insightful navigation in OpenDataDEx, illuminating the potential of Open and Linked Data, while revealing the areas where data curation is sorely needed.
Querying Wikidata for Public Domain Works data
If you were to go to https://www.wikidata.org/wiki/Q45585, the Wikidata entity page for Van Gogh’s Starry Night, you’d see the property P6216, “copyright status”, and value Q19652, “public domain”. This appears to be the primary tag in Wikidata to explicitly assert that a work is in the public domain. Unfortunately, querying for “?item wdt:P6216 wd:Q19652” currently causes the SPARQL endpoint to timeout. Whether this is a limitation of Wikidata’s triplestore or whether the dataset is truly too large for retrieval, a smaller subset must be retrieved. A good way to do this is to add “?item wdt:P31 [type]” to the query, where P31 is Wikidata’s foundational “instance of” property and [type] is the entity for the type of creative work, such as Q7725634, “literary work”. This addition will likely prevent timeout, but of course, only specified types of work will be included.
To get a subset of public domain works that vary across instance types, using the work’s public domain date instead of copyright status is a good option. This was the path chosen for the accompanying OpenDataDEx demo. The query for “?item wdt:P3893 ?pdd”, where P3893 is the “public domain date” Wikidata property, and with filter for all works where this date has passed, “FILTER(?pdd < NOW())”, the result set returns quickly from the endpoint and provides works across domains. However, it obviously excludes all public domain works that do NOT have a value for that “public domain date” property. The reasons for a work not having this date are many: work created before copyright law existed, unclear publication or authorship, changes in copyright law, international conflicts. And, of course, there might be a public domain date, but it just hasn’t been entered into Wikidata yet.
Both the “public domain date” and “copyright status” properties in Wikidata allow for qualifiers on the values, such as P1001, “applies to jurisdiction”, and P459, “determination method or standard”. These statement qualifiers can be used to further narrow the subset of public domain works. For example, an option for pulling works that are in the public domain in Japan is to query for the qualifier P1001 (“applies to jurisdiction”) with value Q117408984, “countries with 75 years pma or shorter”, since Japan provides copyright protection for authors’ lifetime plus 70 years.
Querying on either “public domain date” or “copyright status” has its limitations, pros and cons. There is a third alternative, which is possible because copyright law is codified. Therefore, if you know the law, you can infer a creative work’s public domain status. For example, most works have the publication date in Wikidata, so if that date is before 1928, you could infer that the work is part of the public domain in the US.
Public Domain Works data as a Knowledge Commons
I speculate that an AI future will draw heavily from public domain works. In 2025, Harvard Library released its corpus of the over one million public domain books it had digitized, providing an excellent learning ground for open AI models. Meanwhile, some creators are becoming increasingly hesitant to publish their copyrighted art on the Internet, so learning models outside of corporate data warehouses may need to more significantly rely on content entering the public domain. The metadata on all these works is not secret or private, but there are currently huge gaps in public information and access. A lot of work must be done for public domain knowledge to actually be a useful digital resource to humans.
Fortunately, museums, the Library of Congress, and, of course, Wikidata, have set up structure, access, and incentive to follow through with this knowledge curation. Wikidata especially facilitates data entry by anyone who cares, for any reason, about the truthfulness of a particular public domain work’s details. My hope is that communities come together to curate this data, and that organizations commit to Semantic Web standards and Linked Data principles so that the knowledge is not only open, but it is rich, meaningful, and interoperable. By doing so, the benefits of the public domain will be greatly expanded, enabling works to be better compared, grouped, explored, and searched. May the future be one where a thriving “public domain knowledge commons” serves infinite use cases.
To see one of those use cases, please see the accompanying video demoing the exploration of public domain works metadata in Wikidata using the OpenDataDEx.