Exploring Open Data: Seattle Mariners Players in Wikidata
The Current State of Seattle Mariners Players Open Data
I am not a big sports buff, but I just may become one now that I see the wealth of sports data that has been and continues to be curated by fan communities in open and collaborative ways. What these fans call stats is just data, and their care of these stats is a testament to the idea that public data can meet high standards of accuracy and completeness. The sport with arguably the most committed data nerds is baseball. I grew up in the Puget Sound region of Washington and going to the Kingdome (and then Safeco Field) to watch Ken Griffey Jr., Edgar Martinez, Ichiro Suzuki play baseball for the Seattle Mariners were core childhood memories. Although I didn’t find any open data sources specific to Seattle Mariners players, their participation in Major League Baseball (MLB) means any MLB players dataset should include them.
The gold standard of open professional baseball statistics is the Lahman Baseball Database, created by author and journalist Sean Lahman, who made the database open in 1995 and donated it to SABR (Society for American Baseball Research) in 2024. The database includes meticulously researched batting, pitching, and fielding statistics, standings, team info, managerial records, postseason data, and more dating back to 1871 (the dawn of professional baseball) up to the present. The data has become foundational to baseball analysis and scholarship. The current database, under SABR, now also includes Negro League stats curated by Seamheads.com (who also curate a Ballparks Database). This has been a commendable addition as the MLB has only recently recognized the Negro League as a professional baseball league. The Lahman Baseball Database is open per downloads available in .bak (SQL), .mdb (MS Access), and .csv (individual tables) formats.
The complement to the Lahman Baseball DB is Retrosheet.org’s open game data. A volunteer-run non-profit, Retrosheet has curated play-by-play accounts of all MLB games since 1910, and box scores for all games since 1898. It is so comprehensive that simulations of historic games have been able to be made using the data. Events and box scores from Negro League games are also included. Data downloads are available from raw games “event” files (.evn, .eva), which are parsed into .csv files for different statistical focus.
Although both the Lahman and Retrosheet databases are well curated, the data has not been brought up to Linked and Semantic Web standards. It is not queryable unless you use online dashboards or load the data into your personal database software. Entities do not have semantic identifiers, so you must have a deep understanding of the data to be able to link tables to each other. This makes it difficult to retrieve all available data for players who have played for a specific team. A player will exist in multiple databases (e.g., award winners, team rosters, batting statistics), so getting all available data on them is not straightforward.
The official MLB site does provide queryable and structured data, including player data, via their stats API. It is surprisingly “open”, but full access is limited and clear documentation is nearly impossible to find. Fan communities have learned to reverse-engineer the API and help each other use the API to retrieve the information they need. For users who don’t need direct access to the data, the MLB’s Baseball Savant site provides exploration of the data via dashboards and summary pages.
Clearly, a gap remains on the ability to directly retrieve Open Data on Seattle Mariners players, especially data that is able to connect and federate with multiple authoritative sources. Fortunately, the Seattle Mariners players data in Wikidata, Wikimedia’s open knowledge base, is robust enough to serve as a starting point to building that ability.
Exploring Seattle Mariners Players in Wikidata
People have done a very good job of creating entities in Wikidata for almost every major baseball player that has played for the Mariners. Enough details have been added for each player that navigating through them in a data explorer like OpenDataDEx provides meaning and insight. Very nice clusters emerge around properties position played (P413), award won (P166), and other teams the player has played for (P54). It is interesting to see relationships between players who played during completely different eras of the Mariners.
One semantic issue that appears is the fact that a specific MLB player may have played for the Mariners at one point in their career, but they are much better known as playing for another team. For example, infielder and designated hitter Justin Turner only played for the Mariners for half a single season and is much better known for his years on the Dodgers, but he still shows up in the same exploration as Seattle heroes like Ken Griffey Jr. and Edgar Martinez. One could infer the significance of a team to a player’s career by how long the player was on the roster or what achievements he earned while playing for it, but that guess may not always reflect reality. One clear indicator of a player’s importance to the Mariners is their induction into the Seattle Mariners Hall of Fame (Wikidata entity Q7442130). There also may be debate by fans over which team a player represents, if any, such as Alex Rodriguez becoming a star with the Mariners, but cementing his career and winning a World Series with the Yankees.
The talent level of a player in general is also not directly recorded in Wikidata. Although one could infer this from awards received or participation in championship events, none of the fine grained statistics curated by fan communities are in Wikidata to offer skill comparison. This may not be necessary, though, if open stats databases could be semantically linked from Wikidata. And actually, for most MLB player entities, the Retrosheet and official MLB IDs have been entered. Work still needs to be done to comprehensively record the Lahman Database ID, but there may already be mappings to this ID from the Retrosheet or MLB IDs. If full linkage is able to be done from the player’s metadata in a general knowledge base like Wikidata to their specific game performance in expert baseball statistics, a user could retrieve a very full picture of Seattle Mariners players.
Querying Wikidata for Seattle Mariners Players
Wikidata supports two tags (property:value pairs) that should make it straightforward to retrieve all baseball players that have played for the Mariners: [“member of sports team” : “Seattle Mariners”] ([P54:Q466586]) and [“occupation” : “baseball player”] ([P106:Q10871364]). Querying on these two tags does provide a comprehensive list of people who ever played for the Mariners. However, it also reveals a problematic ambiguity in the case of MLB players that have gone on to become staff, such as a coach or manager, for different teams. An entity who was a baseball player, but never for the Mariners, and then joined the Mariners as staff after retirement, could show up in results using those two tags.
Wikidata’s qualifier structure could help resolve this ambiguity. For the tag [“member of sports team” : “Seattle Mariners”], you could have P106, “occupation” as a qualifier. Or for the tag [“occupation” : “baseball player”], you could have the qualifier P54, “member of sports team”. Or if for both those tags, there was a qualifier stating the dates it was true, you could infer that the player was actively a player with the Mariners by comparing those dates. The better fix, though, would be to not use P54 “member of sports team” at all, and instead use a property that distinguishes between “played for Mariners” versus “was staff for Mariners”. For this Open Data Exploration, I chose to accept the possibility for the ambiguity as it is fairly rare - most players are less noted for their post-playing career work, so that info is often missing from Wikidata. Also, many players go on to become staff for teams they had played for.
Another issue that affects query results is that for a very small number of Mariners (23 entities), their occupation tag in Wikidata is instead [“occupation” : “professional baseball player”], [P106:Q11336312]. This causes them to be completely left out of the main query, which returns well over 800 players. Surprisingly, Alex Rodriguez has this alternative occupation value, so he does not appear in the main query results. My recommendation is that Q11336312, “professional baseball player”, should not be used as a value for occupation. First, it is not the widely used value, and, second, it is a bit redundant. A person who’s primary job is “baseball player” can be considered a professional.
Seattle Mariners Players Data as a Knowledge Commons
Baseball is considered one of America’s oldest pastimes. Although its popularity has waxed and waned, the investment by fans into player information and statistics has endured. With the rise of fantasy baseball and online sports betting, I don’t expect the need for accurate and comprehensive player data to go anywhere. In fact, it seems to be getting more sophisticated, with technology enabling new types of stats and analysis to be observed and recorded. Computers have always been embraced by baseball archivists, with open digitization and sharing of these statistics beginning as early as the 1980s. Baseball fans, including Seattle Mariners fans, have truly created a knowledge environment where community-driven data curation thrives.
The next step is to bring that knowledge up to Open, Linked, and Semantic Web standards so that the data can truly serve as a Knowledge Commons. If Wikidata, the Lahman Database, the Retrosheet Database, the MLB API, and other sources could all be considered together in the facts and ontologies they offer, the future of baseball knowledge, which will be parsed by machines and AI, will be rich indeed.
For a demonstration of the potential of Open Data, see the above video exploring the Seattle Mariners players in Wikidata using the OpenDataDex.