For my tutorial project for Introduction to Data Science, I plan to investigate the most-played games on the video game digital distribution service Steam to determine which games are the most-played on its platform as of recent and of all time and examine their respective play trends. The resources used to extrapolate this data will be data and datasets found through the statistics page of Steam, Steam’s developer API, the data catalog data.world, and the third-party tool SteamDB.
From my initial research, it seems that the games PLAYERUNKNOWN’S BATTLEGROUNDS (PUBG), Defense of the Ancients 2 (Dota 2), Counter Strike: Global Offensive (CS:GO), Grand Theft Auto V (GTA V), and Destiny 2 are the top most-played games as of recent. This finding will be further reflected upon with the progression of this project. By analyzing the games’ player count per 24-hour play cycle and since their respective release dates, ideal times for announcement of sales, maintenance of servers, and release of downloadable content (DLC) can be gleaned.
There is a vast amount of data which keeps consistent records for games’ change in player count since their respective release dates that would make this wealth of data considerably reliable as well as plentiful. Even if there are some days that do not have records for player count, the trend captured before and after the missing data points can help easily help estimate what the value missing would be. Because there are multiple websites which host the data records for the player counts for particular games, data biases can be avoided. The aforementioned resources archive and publish legacy statistics and are updated regularly, meaning the data is current; this means that the trends and observations gathered from the data do not have to be subjected to irrelevance.
I plan to work on this project with respect to this class without a partner; however, it should be noted that the effort I put into the data collection and refinery is also going into my computer science capstone project for I will use a similar dataset for both projects. I had initially intended going through the data pipeline process for these datasets with respect to the computer science capstone project, but I extended the idea to make it applicable for this course. Although I have two partners for the capstone project, I will be the primary data fetcher and manipulator. The reason as to why this dataset was chosen is because it is plentiful and robust. The data keeps track of player count over time for a particular game which is useful in determining trends for future plans of action. Because every occurrence of the data is complete (meaning that there aren’t any missing data), these datasets can effectively be used to form observations and identify trends.
As the data is accumulated, many questions can be posed. For example, "When was the maximum/minimum conitinuous player count for this particular game?", "What caused these extrema?", "Do these relative extrema have a trend and can this trend be capitalized upon?", etc. By taking into consideration such questions, the dataset is more manageable for there is a particular goal in mind amidst the vast amount of information. With respect to my datasets, I aspire to observe similarities in several individual games ' statistics compared with the averages of the aggregated data to see if these trends withstand in a trial observation.
I have already begun making progress in accessing the sources which contain the match statistics. I plan to extract, transform, and load the data so it is in a more manageable format ready to be analyzed. I hope to be able to manipulate the match statistics such that trends and observations like player trends can be utilized more effectively. I also hope to observe subtle nuances in the gameplay such that an ideal game plan for releasing updates and downloadable content can be formulated.
In the datasets below, there are three variables which are generally speaking DateTime, Twitch Viewers, Player Trends, and Players. This information is crucial because it coordinates the number of players and viewers with a particular time since the respective games' release date and within the past week while averaging the recent trend to capture a median that represents the current trend in player counts.
To keep our data organized, let us keep all our imported libraries at the beginning.
# Import libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
Also, let us read in the CSV files found on SteamDB containing the player count, Twitch viewers, and player trends for PLAYERUNKNOWN’S BATTLEGROUNDS (PUBG), Defense of the Ancients 2 (Dota 2), Counter Strike: Global Offensive (CS:GO), Grand Theft Auto V (GTA V), and Destiny 2 since their respective release dates and within the past week.
# Read in datasets from Steam DB
CSGO_week = pd.read_csv('./CSGO_week.csv')
CSGO_all = pd.read_csv('./CSGO_all.csv')
Destiny_week = pd.read_csv('./Destiny_week.csv')
Destiny_all = pd.read_csv('./Destiny_all.csv')
DOTA_week = pd.read_csv('./Dota_week.csv')
DOTA_all = pd.read_csv('./Dota_all.csv')
GTA_week = pd.read_csv('./GTA_week.csv')
GTA_all = pd.read_csv('./GTA_all.csv')
PUBG_week = pd.read_csv('./PUBG_week.csv')
PUBG_all = pd.read_csv('./PUBG_all.csv')
Let's see that the data was loaded in properly by loading in the head of some of the datasets. The other datsets can be viewed by using the same method.
CSGO_week.head()
CSGO_all.head()
Destiny_week.head()
Destiny_all.head()
Thankfully the data is fairly clean. If there are any missing variables, the entry will be voided. Because the data is plentiful, the missing data will be preserved by its surrounding information because the trend will estimate the missing entry.
All variables need to be ensured that they are the proper dtype. The DateTime variable needs to be a datetime object whereas the Twitch Viewers, Player Trends, and Players variables need to be integers or floats because they are numeric.
# Determine that dtypes were inferred properly
CSGO_week.dtypes
It seems that the DateTime object for the dataset containing information on the past week's player statistics for CS:GO was not inferred properly. Because the other datasets were retrieved from the same source and they have the same variables, all DateTime variables need to be converted to datetime objects. This will ensure any manipulation regarding the DateTime variable will be handled properly.
# Convert the DateTime variable in all dataframes to a datetime dtype
CSGO_week['DateTime'] = pd.to_datetime(CSGO_week['DateTime'])
CSGO_all['DateTime'] = pd.to_datetime(CSGO_all['DateTime'])
Destiny_week['DateTime'] = pd.to_datetime(Destiny_week['DateTime'])
Destiny_all['DateTime'] = pd.to_datetime(Destiny_all['DateTime'])
DOTA_week['DateTime'] = pd.to_datetime(DOTA_week['DateTime'])
DOTA_all['DateTime'] = pd.to_datetime(DOTA_all['DateTime'])
GTA_week['DateTime'] = pd.to_datetime(GTA_week['DateTime'])
GTA_all['DateTime'] = pd.to_datetime(GTA_all['DateTime'])
PUBG_week['DateTime'] = pd.to_datetime(PUBG_week['DateTime'])
PUBG_all['DateTime'] = pd.to_datetime(PUBG_all['DateTime'])
The results of our loaded data must be visualized to get a better understanding of noticeable trends of specific extrema. By displaying the player counts and twitch viewers for both the past week and since the game originally came out, the datasets can reveal further information about themselves.
The information on Twitch viewers is not crucial to the exploration of our datasets; its existance merely supports any claims about positive or negative interest in the game.
Below are visualizations of the data from the datasets.
# CS:GO past week player graph
ax = plt.gca()
CSGO_week.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (20, 10), fontsize = 20)
CSGO_week.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (20, 10), fontsize = 20)
plt.title('"Counter Strike: Global Offensive" Past Week Players', fontsize = 20)
plt.legend(fontsize = 20)
plt.xlabel('Date', fontsize = 20)
plt.ylabel('Number of People', fontsize = 20)
plt.show()
# CS:GO all time player graph
ax = plt.gca()
CSGO_all.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (40, 20), fontsize = 40)
CSGO_all.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (40, 20), fontsize = 40)
plt.title('"Counter Strike: Global Offensive" All Time Players', fontsize = 40)
plt.legend(fontsize = 40)
plt.xlabel('Date', fontsize = 40)
plt.ylabel('Number of People', fontsize = 40)
plt.show()
# Destiny 2 past week player graph
ax = plt.gca()
Destiny_week.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (20, 10), fontsize = 20)
Destiny_week.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (20, 10), fontsize = 20)
plt.title('"Destiny 2" Past Week Players', fontsize = 20)
plt.legend(fontsize = 20)
plt.xlabel('Date', fontsize = 20)
plt.ylabel('Number of People', fontsize = 20)
plt.show()
# Destiny 2 all time player graph
ax = plt.gca()
Destiny_all.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (40, 20), fontsize = 40)
Destiny_all.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (40, 20), fontsize = 40)
plt.title('"Destiny 2" All Time Players', fontsize = 40)
plt.legend(fontsize = 40)
plt.xlabel('Date', fontsize = 40)
plt.ylabel('Number of People', fontsize = 40)
plt.show()
# Dota 2 past week player graph
ax = plt.gca()
DOTA_week.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (20, 10), fontsize = 20)
DOTA_week.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (20, 10), fontsize = 20)
plt.title('"Defense of the Ancients 2" Past Week Players', fontsize = 20)
plt.legend(fontsize = 20)
plt.xlabel('Date', fontsize = 20)
plt.ylabel('Number of People', fontsize = 20)
plt.show()
# Dota 2 all time player graph
ax = plt.gca()
DOTA_all.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (40, 20), fontsize = 40)
DOTA_all.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (40, 20), fontsize = 40)
plt.title('"Defense of the Ancients 2" All Time Players', fontsize = 40)
plt.legend(fontsize = 40)
plt.xlabel('Date', fontsize = 40)
plt.ylabel('Number of People', fontsize = 40)
plt.show()
# GTA V past week player graph
ax = plt.gca()
GTA_week.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (20, 10), fontsize = 20)
GTA_week.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (20, 10), fontsize = 20)
plt.title('"Grand Theft Auto V" Past Week Players', fontsize = 20)
plt.legend(fontsize = 20)
plt.xlabel('Date', fontsize = 20)
plt.ylabel('Number of People', fontsize = 20)
plt.show()
# GTA V all time player graph
ax = plt.gca()
GTA_all.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (40, 20), fontsize = 40)
GTA_all.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (40, 20), fontsize = 40)
plt.title('"Grand Theft Auto V" All Time Players', fontsize = 40)
plt.legend(fontsize = 40)
plt.xlabel('Date', fontsize = 40)
plt.ylabel('Number of People', fontsize = 40)
plt.show()
# PUBG all time player graph
ax = plt.gca()
PUBG_all.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (40, 20), fontsize = 40)
PUBG_all.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (40, 20), fontsize = 40)
plt.title('"PLAYERUNKNOWN’S BATTLEGROUNDS" All Time Players', fontsize = 40)
plt.legend(fontsize = 40)
plt.xlabel('Date', fontsize = 40)
plt.ylabel('Number of People', fontsize = 40)
plt.show()
# PUBG past week player graph
ax = plt.gca()
PUBG_week.plot(kind='line',x='DateTime',y='Players', ax=ax, figsize = (20, 10), fontsize = 20)
PUBG_week.plot(kind='line',x='DateTime',y='Twitch Viewers', color='purple', ax=ax, figsize = (20, 10), fontsize = 20)
plt.title('"PLAYERUNKNOWN’S BATTLEGROUNDS" Past Week Players', fontsize = 20)
plt.legend(fontsize = 20)
plt.xlabel('Date', fontsize = 20)
plt.ylabel('Number of People', fontsize = 20)
plt.show()
The local maximums and minimums are very telling for the games since their respective release dates. For example, in June of 2019, there is a large spike in Twitch viewers for Grand Theft Auto V and that coorelates to a much anticipated update to the game. The evidence in this update is also reflected by the spike in player counts around that time as well. By analyzing local trends, the most extreme local minimum in Dota 2 in early 2015 is attributed to a wide-spread server outage because of the immediate local maximum immediately following the event. All games also have general increase in players during times of mass clearances such as the Summer Steam Sale in June and July as well as during large gaming tournaments such as Dota's The International competition in August of each year.
Additionally, all of the weekly records for the top five games indicate that there is an evident cyclic pattern over a given 24 hour period where there are guaranteed local minimums and maximums. This information is imperative for it allows game developers to understand wheen is the most ideal to to role out game updates or server maintenace so that the discontinuity impacts the least amount of people. By paying attention to such trends, a vast majority of players can be appeased and not be frustrated with the game developer for inconveniencing the player's experience.
Notice that the units on the y-axis vary from graph to graph. To get a better understanding of how the data of one game relates to another game, the data must be concatenated and then displayed.
# Read in the dataframe from SteamDB
All_all = pd.read_csv('./All_all.csv')
All_all.head()
# Ensure all variables are the proper dtype
All_all['DateTime'] = pd.to_datetime(All_all['DateTime'])
All_all.dtypes
# All games all time player graph
ax = plt.gca()
All_all.plot(kind='line',x='DateTime',y='Dota 2', color='red', ax=ax, figsize = (20, 10), fontsize = 20)
All_all.plot(kind='line',x='DateTime',y='Counter-Strike: Global Offensive', color='orange', ax=ax, figsize = (20, 10), fontsize = 20)
All_all.plot(kind='line',x='DateTime',y='Grand Theft Auto V', color='green', ax=ax, figsize = (20, 10), fontsize = 20)
All_all.plot(kind='line',x='DateTime',y='Destiny 2', color='blue', ax=ax, figsize = (20, 10), fontsize = 20)
All_all.plot(kind='line',x='DateTime',y="PLAYERUNKNOWN'S BATTLEGROUNDS", color='yellow', ax=ax, figsize = (20, 10), fontsize = 20)
plt.title('Top Five Games All Time Players', fontsize = 20)
plt.legend(fontsize = 20)
plt.xlabel('Date', fontsize = 20)
plt.ylabel('Number of People', fontsize = 20)
plt.show()
By displaying all games with the same unit of visualization, it can be understood that player trends directly impacted by the other top five games' trends. For example, There is a noticeable shift downwards in player count in DOta 2 as PUBG became more popular but there is a resurgance in player count for Dota 2 after interest in PUBG begins to decline. By being aware of such coorelations, game developers can tune into the demands of their players more by aiming to release more favorable updates or downloadable content to increase interest in the game.
From the data provided in the datasets, information can be extrapolated. Additionally, outside events can explain particular dips (such as during server outages) and peaks (such as when an anticipated update is released). Because PLAYERUNKNOWN’S BATTLEGROUNDS (PUBG), Defense of the Ancients 2 (Dota 2), Counter Strike: Global Offensive (CS:GO), Grand Theft Auto V (GTA V), and Destiny 2 are consitantly the top continuously-played games on Steam, particular local maximums and minimums over time are particualrly telling of outside forces.
These outside forces can be further inferred upon; additionally, noticeable trends in the data can provide insight on future plans for the game development process. Any observed negative trends for a generally high played game can bolster support for additional features or update to content.