GSC Logo

Disclaimer: This website is not the official project website but merely a development lab that I used and kept online because it seems useful to get an overview and to quickly request some basic information. For citation please use the corresponding paper or Zenodo resource as noted here. This website should also not be considered as a reliable API, so please do not use the web requests in your analytical processes and instead work with the downloaded resources.



Video games are one of the most influential entertainment mediums in our world and influence large parts of the society directly as a means of entertainment, education and recreation or indirectly as it is seen in social gamification processes. A multi billion dollar market has developed around this medium that is in some cases even under legal observation because of harmful effects on parts of the population.

Yet the video game world is not yet a prominent part of the scientific research process. One reason is that it is hard to answer the question how an analysis should be handled. Video games consist of bytecode that can not really be interpreted constructively and are mainly consumed via audiovisual means, which are hard enough to analyse for themselves. Another reason is also one of video games' defining properties: Interactivity. How would we be able to analyse something when its actual content is in big parts created and controlled by the person that consumes it?

The purpose of this project is to provide a solution for both of these problems and hopefully a robust starting point for empirical work in the field of Games Studies. Video game walkthroughs provide a textual representation of the video game in question and contain exactly the information that is needed to complete the game. These descriptions ignore the (theoretically infinite) variance of outcomes that are the result of the interaction element. Additionally they convert the content of a video game into text, an information medium that is routinely analysed in many ways in various research environments.




Project Overview

Goal: The goal of this project is to publish a text corpus that compiles video game walkthroughs from various sources for textual analysis.
Project Coordinator: Dr. Jochen Tiepmar, Natural Language Processing Group, Leipzig University, https://orcid.org/0000-0002-3866-2830
Project Start: 12.02.2020
Project End: "When it's done"
Contact: jtiepmar(at)informatik.uni-leipzig.de
Bitbucket: https://bitbucket.org/jtiepmar/the-game-walkthrough-corpus/src
DOI:https://doi.org/10.5281/zenodo.4562336
Citation Data Set:
Tiepmar, J., and Burghardt, M., 2021. Game Walkthrough Corpus (GWTC) (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4562336

Citation Paper:Burghardt, M. and Tiepmar, J., 2021. The Game Walkthrough Corpus (GWTC) - A Resource for the Analysis of Textual Game Descriptions. Journal of Open Humanities Data, 7, p.14. DOI: http://doi.org/10.5334/johd.34
[OPEN ACCESS]

Copyright Information

Game walkthroughs are protected by individual copyright notices that are often very strict. That is why this data set does not include the documents but instead various data formats that are useful for text mining and distant reading methods while not allowing to recreate the documents. It is highly unlikely that even a single sentence can be reconstructed from the published data.
Since the documents are not -- not even in part -- published but only text mining statistics about them, no violation of copyright is done by this project. The data that is made available here is published as Creative Commons CC BY 3.0.

Links to the original documents are available in the data section.




Downloadable Content

Data

You can create subsets by provided a filter in the URL as it is done in the table. The filters will work based on the CTS URNs. The filtered statistics will use   ( & e m s p ; ) instead of [TAB] because HTML does not understand [TAB] and I don't know how to write .txt with Javascript similiar to the raw data files.
You can for example request information only for german documents using the filter ".deu." as it is done here and visualized here. If no filter is provided, the statistical information for the whole data set is requested.

If you are interested in specific statistics that are not covered, feel free to contact me.

To comply with copyright regulations, all data are randomized and provided in a way that makes it impossible to recreate the documents (or even a single sentence) while still being useful for analysis.

CorpusStatistics

Document Statistics Download Format Visualisation
(URLs to) Texts Raw Data
Usage of Filter
DocID [TAB] URL [TAB] Release Date
Text Length per Document (Characters) Raw Data
Usage of Filter
Tab separated key-value pairs Bar Chart
Bar Chart with Filter
Type / Token count Types
Tokens
Type/Token
Tab separated key-value pairs Bar Chart (Types)
Bar Chart (Tokens)
Bar Chart (Type/Token)
Bag of Words Raw Data (211 MB)
Usage of Filter
Tab separated key-value pairs with python dictionaries as values.
nGrams Zipped .txt (>1 GB) Tab separated key-value pairs with python dictionaries as values.
Sentence Collocations Zipped .txt (>1.5GB)
Sentence order and order of tokens per sentence are randomized.
Tab separated key-value pairs with lists of sentences represented by python dictionaries.
TF IDF English (>230 MB)
German (>20 MB)
Tab separated key-value pairs with python dictionaries as values.
Walkthrough Documents per Game Raw Data
Document count
Tab separated key-value pairs with comma separated values Bar Chart
Bar Chart with filter (Metal Gear Solid)




Metadata

The metadata is compiled from Steam and RAWG, which means there is a serious PC-Bias but console games are also included.

Metadata Download Format Visualisation
Full List of Game Titles Raw Data
HTML Table
Tab separated key-value pairs
Short Descriptions (RAWG) Raw Data
HTML Table
Tab separated key-value pairs
Gameplay Tags Raw Data
HTML Table
Tab separated key-value pairs with comma separated values Histogram
Combinded Histogram
Time Series "Point&Click"
Genres Raw Data
HTML Table
Tab separated key-value pairs with comma separated values Histogram
Combinded Histogram
Time Series "Indie"
Publishers Raw Data
HTML Table
Tab separated key-value pairs with comma separated values Histogram
Combinded Histogram
Time Series "Square Enix"
Developers Raw Data
HTML Table
Tab separated key-value pairs with comma separated values Histogram
Combinded Histogram
Time Series "Ubisoft"
Supported Game Languages Raw Data
HTML Table
Tab separated key-value pairs with comma separated values Histogram
Combinded Histogram
Supported Platforms (PC, Gameboy, iOS,...) Raw Data
HTML Table
Tab separated key-value pairs with comma separated values Histogram
Combinded Histogram
Release Date Raw Data
HTML Table
Tab separated key-value pairs with comma separated values YYYY-MM-DD Time Series
Combined Metadata Raw Data
HTML Table
Tab separated with column header(For nested formats see individual entries) Coverage Overview




Corpus size

Documents: 12295
Words more than 140 Mio (types)
Combined Text Length: more than 940 Mio characters

Project Statistics

Game Language Associations: 4631
Walkthrough Languages: deu, eng
Walkthrough Sources: portforward neoseeker spieletipps jayisgames gamesetter
Number of Games: 6013
Genre Associations: 3806
Gameplay Tags: 10246
Release Dates: 2443
Developers: 3152
Publishers: 2782
Steam IDs: 1086
Platform Associations: 5293 (PC, Gameboy, iOS, Linux,...)




(Optional) Roadmap

Suggestions, hints and help are welcome


GWTC-Logodesign by Mimikry