Exploring Open Data: Supreme Court Rulings in Wikidata

The Current State of Open Data on U.S. Supreme Court Rulings

The recording of law has predominantly been a “document” practice. Legal process largely involves submitting statements in filings, oral arguments are transcribed, judges’ decisions are written up and summarized. Deconstructing the content of these documents into structured data, especially machine-readable structured data, has been and is a slow, meticulous, multi-faceted process. While several European nations have made huge strides to bring their legal data to current Open, Linked, and Semantic standards, much of U.S. legal digital data is still only accessible via document formats, behind paywalls, online dashboards, or confusing data dumps. The Administrative Office of the U.S. Courts manages access to official records through PACER (Public Access to Court Electronic Records), a controversial system criticized for being archaic and restrictive, so much so that activist efforts have been made to release the data to the public (see Carl Malamud and Aaron Swartz).

An exception is U.S. Supreme Court Rulings, which have been curated and made public to a much higher quality and level of accessibility. Although the Supreme Court and Library of Congress websites primarily offer official case records in the form of documents and audio recordings, many law schools and legal non-profits have turned that media into structured open data, turning that information into a machine-readable public resource. The gold standard is Washington University Law’s Supreme Court Database (SCDB), an initiative of esteemed law and political science professor Harold Spaeth with the help of the National Science Foundation. This database is so respected, it has become a foundation for the study and critique of the Supreme Court by legal researchers, journalists, and academics. It is made open to the public via an online analysis UI and data dumps in multiple formats. The data is not interoperable via Linked Data standards, but many other organizations import the data into their more structured data stores that power better semantic legal research tools, like the independent Free Law’s CourtListener.

A huge benefit of the SCDB is that because its curation is led by expert law professors, the content and structure of the data is very solid and should be used as an example of what all court case data should include. Per the official website, the 247 statements for each case (as of this writing) can be grouped into 6 categories:

identification info, such as citations and docket numbers
background info, such how the Court took jurisdiction, origin and source of the case, the reason the Court agreed to decide it
chronological info, such as the date of decision, term of Court, natural court
substantive info, such as legal provisions, issues, direction of decision
outcome info, such as disposition of the case, winning party, formal alteration of precedent, declaration of unconstitutionality
voting and opinion info, such as how the individual justices voted, their opinions and interagreements

If Supreme Court Rulings knowledge is to serve as a true public resource, we should ensure it not only meets Open, Linked, and Semantic data standards, but that the knowledge covers all 6 categories above. I kept this in mind while exploring Supreme Court Rulings Open Data in Wikidata.

Exploring U.S. Supreme Court Rulings in Wikidata

Wikidata, the open knowledge base of the Wikimedia Foundation, appears to contain entities for almost all major Supreme Court rulings. I am not a legal expert, so there may be glaring omissions I don’t see, but to a casual user, there should be enough cases to provide a comprehensive exploration. I pulled the entities into the OpenDataDEx for heuristic structuring and visual navigation. Here are my observations on the data present from the 6 categories above:

Identification Data

Most of the Supreme Court cases with rulings in Wikidata do have the 3 statements using property P1031, “legal citation of this text”, that you would expect:

the official United States Reports citation
West Publishing’s Supreme Court Reporter citation
the LexisNexis Lawyer’s Edition citation

Most also have the citation from the Lexis Database, one of the two most used commercial legal research databases (proprietary). The other is the Westlaw database, but I do not see many cases with the Westlaw citation.

Background Data

Background knowledge of cases is lacking in Wikidata. In most cases, the best one can hope for is to at least have the defendant, property P1591, and plaintiff, property P1620, explicitly stated. Any information of the case from before it reached the Supreme Court is difficult to discover just from the Wikidata statements.

Chronological Data

Almost every ruling in Wikidata has the initial publication date, property P577. If a significant event of the case was stated, such as an oral argument (Q7099379), it will almost always have a “point in time” (P585) qualifier. Some dates needed in Wikidata are the court terms, ie, when the Supreme Court started and ended deliberation of the case.

Substantive Data

This is where I think Wikidata could benefit the most from input by legal experts. Most of the substance of cases is not captured, and as such, there are no natural clusterings in an app like OpenDataDEx around what the cases are about. Some cases, especially the more widely known ones, do have “main subject”, property P921, or “facet of”, P1269, stated, but these are very broad properties. People with a good handling of legal semantics may know of better tags ([property:value]) to express the core issues of a case.

Some rulings in Wikidata do state the other cases referenced via the “cites work” property P2860. This can provide good semantic context if these referenced cases have more substantive details.

Outcome Data

Surprisingly, the outcomes of Supreme Court rulings is glaringly missing from Wikidata. Most rulings do not have any type of “winning party” property stated. Though the involved judges of the decision may have been input, the actual content and direction of those votes likely were not. Perhaps this is more semantically complex to capture than I am imagining, but it seems this data specifically would be easy for anyone to find and import into Wikidata.

Voting & Opinion Data

Almost all Supreme Court ruling entities I saw in Wikidata do state the judge who provided the majority opinion (property P5826). The list of majority judges may even be listed under that statement via the “opinion joined by” qualifier property (P7122). In rare cases, other opinion data has been entered, such as a concurrent or dissenting opinion via a “has part(s)” statement, with the judge of that opinion provided with the “author” qualifier.

Querying Wikidata for U.S. Supreme Court Rulings Data

Querying for all Supreme Court Rulings is fairly straightforward due to the existence of the entity Q19692072, “decision of the Supreme Court of the United States”. Therefore, to get a comprehensive dataset of all rulings, you can simply query “?ruling wdt:P31 wd:Q19692072 .”, where P31 is Wikidata’s foundational “instance of” property. There are over 3000 rulings from this query, so getting all statements on every ruling is difficult due to Wikidata’s SPARQL endpoint timing out. A solution to that issue would be to piece together the result sets of multiple queries based on a property that is ubiquitously present, such as publication date or majority opinion judge. Because the majority opinion judge is most likely recorded as a qualifier, publication date may be the best option.

OpenDataDEx limits the total number of triples it processes to 20,000. With this limitation, the query for all statements on all rulings is usually successful (ie, does not timeout at the Wikidata endpoint), but the result set is missing a lot of rulings. In the last DEx generation of this Supreme Court rulings query with the “LIMIT 20000”, we were able to navigate 1815 objects using 8568 unique tags. This means over half of the rulings in Wikidata are missing from the OpenDataDEx exploration, an unfortunate but necessary concession. Regardless, the DEx navigations are still interesting and provide real insight into the data.

U.S. Supreme Court Rulings data as a Knowledge Commons

The current state of Supreme Court rulings Open Data has been adequate enough for law practitioners, researchers, and journalists to analyze the court and cases. However, as we enter a more digitized future, especially one involving agentic AI, the benefit of that data meeting Open, Linked, and Semantic Web standards will be immense. The US can look to many European nations to see how their modernizing of structured legal data creates boundless opportunity for better analysis, practice, and access. These nations often have laws and case history far more semantically complex than in the US because their legal systems are older and may contain deeper historical precedent, and yet the machine-readable data is still able to express truth and nuance.

Supreme Court rulings affect every citizen of the United States, not just legal professionals. As such, the knowledge of them is inherently a public need and resource. Multiple communities of all sizes and types may be interested in curating this knowledge to be truthful, complete, and accessible. Due to the density and complexity of the language in these cases and rulings, people who understand U.S. law are especially needed to express the content of the six ontological categories we explored above. As these different communities, all with different incentive and motive, curate this data, a true Knowledge Commons is created.

For a demonstration of the potential of Supreme Court rulings Open Data, see the exploration of Wikidata entities using the OpenDataDEx in the video above.