A certain casino has contracted our team to determine if there are any patterns among their tables and patron attendance. They are not asking for predictions, just for a description of what is already there.
First let's begin by loading our libraries and Data. Then let's take a quick look at the data we've loaded and some descriptive statistics for the data we've loaded.
# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
df = pd.read_csv(filename)
#Print descriptive statistics for numerical values
df.describe()
As we can see from above our tables open and tables occupied range from 0 to 23/24. However the mean is closer to 2 and all have medians at 1. This shows that in general very few tables are open but there have been instances with up to 25 tables opened. The Headcount column ranges from 0 to 109 with a mean of 7 and a median of 3. The hours range from 0 to 23 representing 24 hours in a day. Day of week ranges from 1 to 7 representing everyday in a week, 1 being Sunday. Day Number ranges from 1 to 366 representing the 366 days in a year covered by this dataset with 1 representing 8/1/2011.
Now let's look at the shape and data types of our data, as well as our top 5 and bottom 5 rows. Let's also check for any NaN values.
#Print shape and data types
print(df.shape)
print(df.dtypes)
#Print top 5 and bottom 5 rows
print(df.head())
print(df.tail())
#Print the counts of NaN values
print(df.isnull().sum())
Now that we've loaded our data and taken a quick look at what we've loaded. We can see that we have no missing values, and our columns are mostly the correct type. However, our DateFormat column is an object and should be changed to a date column.
df.loc[:, 'DateFormat'] = pd.to_datetime(df.loc[:, 'DateFormat'])
Now let's create a Month and Year column from our DateFormat column. Let's also create a copy of our data frame and name it df2. We will set our dateformat column as our index for this data frame so we can use it for creating time plots.
#Create month and year columns
df['Month'] = pd.DatetimeIndex(df['DateFormat']).month
df['Year'] = pd.DatetimeIndex(df['DateFormat']).year
#create Df2 as copy and set date as index for time plots
df2 = df.copy(deep=True)
df2.loc[:, 'DateFormat'] = pd.to_datetime(df.loc[:, 'DateFormat'])
df2.set_index('DateFormat', inplace = True)
Now let's go ahead and look at the Unique values or the number of unique values for each of our columns.
print('GameCode Number Unique: ', df['GameCode'].nunique(), '\n') #Print number of uniques
print(df['GameCode'].unique()) #Print unique values
print('DateFormat Number Unique: ', df['DateFormat'].nunique()) #Print number of uniques
print('Hour Number Unique: ', df['Hour'].nunique(), '\n') #Print number of uniques
print(df['Hour'].unique()) #Print unique values
print('TablesOcc Number Unique: ', df['TablesOcc'].nunique(), '\n') #Print number of uniques
print(df['TablesOcc'].unique()) #Print unique values
print('TablesOpen Number Unique: ', df['TablesOpen'].nunique(), '\n') #Print number of uniques
print(df['TablesOpen'].unique()) #Print unique values
print('TablesClosed Number Unique: ', df['TablesClosed'].nunique(), '\n') #Print number of uniques
print(df['TablesClosed'].unique()) #Print unique values
print('HeadCount Number Unique: ', df['HeadCount'].nunique(), '\n') #Print number of uniques
print(df['HeadCount'].unique()) #Print unique values
print('DayofWeek: ', df['DayOfWeek'].unique()) #Print unique values
print('DayNumber Number Unique: ', df['DayNumber'].nunique()) #Print number of uniques
print('Month: ', df['Month'].unique()) #Print unique values
print('Year: ', df['Year'].unique()) #Print unique values
Now let's bin some of our variables to use later for plotting. Let's also print our bin ranges so we can know what our bins represent.
We will bin the Hours column into 4 bins. We will bin the Tables Occupied column into 4 bins. We will bin the Headcount column into 20 bins.
HourBin = pd.cut(df.loc[:, 'Hour'], bins = 4, labels = range(4), retbins = True)
df.loc[:, 'Hour_bin'] = HourBin[0]
print(HourBin[1])
The Hours are binned into 4 categories. This bins them in 6 hour bins. 0(LateNight) from midnight to 5am, 1(Morning) from 6am to 11am, 2(Afternoon) from 12(noon) to 17(5pm) and finally 3(night) from 18(6pm) to 23(11pm).
TablesOccBin = pd.cut(df.loc[:, 'TablesOcc'], bins = 4, labels = range(4), retbins = True)
df.loc[:, 'TablesOcc_bin'] = TablesOccBin[0]
print(TablesOccBin[1])
The Tables are binned into 4 categories. This bins them in counts of 6 with the first bin having 7. 0-6 tables is Category 0, 7-12 tables is Category 1, 13-18 tables is Category 2, 19-24 tables is Category 3.
HeadCountBin = pd.cut(df.loc[:, 'HeadCount'], bins = 20, labels = range(20), retbins = True)
df.loc[:, 'HeadCount_bin'] = HeadCountBin[0]
print(HeadCountBin[1])
The Headcounts are binned into 20 categories which causes them to be in bins of 5 or 6 headcounts. The lowest range is from 0 to 5 and the highest range is from 104-109.
Let's go ahead and set our binned columns as ints.
#Function to set column as int
def setInt(df, i):
df.loc[:,i] = df.loc[:,i].astype('int64')
#Sets the column as Int type
setInt(df,'Hour_bin')
setInt(df,'TablesOcc_bin')
setInt(df,'HeadCount_bin')
Let's take a look at our data frame now.
#Print top 5 rows
print(df.head())
#print data types
print(df.dtypes)
#Print column names
print(df.columns)
We are ready to do some plotting and to find some relationships.
Let's start by looking at the histogram of our headcounts. We will use 10 bins. Since we have 110 values from 0 to 109 our bins each have 11 values. Our first from 0 to 10.
%matplotlib inline
df.loc[:,'HeadCount'].plot.hist(bins = 10)
As we can see our 0 to 10 bin is the largest bin and our bins continuously decrease. Our headcount column's distribution is that of an exponential decay.
Now let's take a look at the time plot of our headcount.
#Timeplot of HeadCount
ax = plt.figure(figsize=(12, 6)).gca()
df2.loc[:,'HeadCount'].plot(ax=ax)
ax.set_xlabel('Date')
ax.set_ylabel('Headcount')
ax.set_title('Time series of Headcounts')
We can see that our headcounts seem to follow a cyclical cycle with peaks around 109 and troughs around 55. There seems to be around 4 peaks a month. This may coincide with the weekends and the troughs being the middle of the week.
Now let's take a look at some attributes in comparassion to headcount.
Let's start by looking at Headcount in relation to the Game Code.
#Create scatter plot with transparency of .3 for each point.
sns.lmplot(x = 'GameCode', y = 'HeadCount',
data = df,
scatter_kws={'alpha':0.3
},
fit_reg = False)
plt.xticks(rotation= 90)
plt.xlabel('GameCode')
plt.ylabel('Headcount')
plt.title('GameCode vs. Headcounts')
We can see that S6 draws the highest headcounts, followed by CR and then TP. These must be our 3 most popular games.
Let's take a look at the box plot of Gamecode vs Headcount to see if we can gain some further insight on this relationship.
#Boxplot Headcount of Game Code
fig = plt.figure(figsize=(12, 9))
ax = fig.gca()
df.loc[:,['GameCode','HeadCount']].boxplot(by = ['GameCode'] , ax = ax)
plt.xticks(rotation=90)
ax.set_title('Box plot of Headcounts in relation to GameCode')
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0)
Looking at the box plot we get much more granularity about this relationship.
S6, Big Six, is by far the most popular game at this casino. It has the highest median, drawing on average 38 visitors. It also draws the maximum visitor count of all games at 109. The maximum visitor count falls at the top of the box plots whiskers. This shows that the high number of visitors is not an outlier.
CR, Craps, is the second most popular game at this casino. It has a median of 19 visitors and draws a maximum visitor count of around 88. However, the maximum visitor count is an outlier. The maximum visitor count within 1.5 times the interquartile range is actually around 68.
TP, Texas Hold'em Poker, is the third most popular game at this casino. It has a median of 17 visitors and draws a maximum visitor count of around 66. However, the maximum visitor count is an outlier. The maximum visitor count within 1.5 times the interquartile range is actually around 60.
All other games draw much smaller counts with most box plots having whiskers that fall below the third quartile of TP and CR. RO, roulette, and FP, four card poker, are the only games to come up around the third quartile of TP and CR.
Let's take a look at the heatmap of Headcount versus gamecode. To get better information from this graph let's use the Headcount_bin instead of Headcount.
game_count = pd.crosstab(df.loc[:, 'HeadCount_bin'], df.loc[:, 'GameCode'])
num_games = game_count.apply(sum, axis = 0)
game_count = game_count.div(num_games, axis = 1)
print(game_count.head())
ax = plt.figure(figsize=(9, 6)).gca()
ax.pcolor(game_count, cmap = 'Blues')
ax.set_xticks(range(game_count.shape[1]))
ax.set_xticklabels(game_count.columns, rotation=90)
ax.set_xlabel('GameCodes')
ax.set_ylabel('Headcount bin')
ax.set_title('Headcounts of people by GameCode')
Looking at the above heatmap we can see that most games draw 0 to 5 visitors due to the darkness of the blues near the bottom Headcount bin. The lighter the color at the bottom bin the more popular the game is. So we can see that although some games are not nearly as popular as our top games that they are still more popular than others. S6 is the clearly the most popular game as it has the lightest color at 0 and it has heat all the way up to around the top headcount bin. Our Top 5 games in order of popularity seem to be S6: Big Six, CR: Craps, TP: Texas Hold'em Poker, RO: Roulette, FP: Four Card Poker. Our least popular games seem to be BA: Baccarat, MQ: Blackjack variant, ND: Blackjack variant, PA: Pai Gow Tiles, PG: Pai Gow, TL: Blackjack variant and WR: Casino War.
Let's keep going by looking at the Scatterplot of Headcount versus the Hour to see if we can spot a relationship between the visitors and the time of day.
#Scatterplot of Hour and Headcount
sns.lmplot(x = 'Hour', y = 'HeadCount',
data = df,
scatter_kws={'alpha':0.3
},
fit_reg = False)
plt.xlabel('Hour')
plt.ylabel('Headcount')
plt.title('Hour vs. Headcounts')
We can see that there seems to be a trend between the headcount at the casino and the hour of the day.
Let's take a look at the box plot of Hours vs Headcount to see if we can gain some further insight on this relationship.
#Boxplot HeadCount by Hour
fig = plt.figure(figsize=(9, 9))
ax = fig.gca()
df.loc[:,['Hour','HeadCount']].boxplot(by = ['Hour'] , ax = ax)
ax.set_title('Box plot of Headcounts in relation to Hour')
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0)
Looking at the boxplots we can see that the median visitor counts are around the same from 1pm to 1am. 5am to 9am draws almost no visitors with their medians around 0. However looking at the upper quartiles, whiskers, and the outliers we can see that the highest headcounts can be seen from 9pm-2am, then there is a pretty significant drop each hour from 1am until 6am where the casino's headcount reaches the lows. The headcount stays low until 11am where it begins to rise until 1pm then it stays steady until the crowd reaches their maximum around 9pm. The high hours can have up to 110 patrons while the low hours can only have up to 20 patrons.
Let's take a look at the hexbin plot of HeadCount versus hours for more detail.
ax = plt.figure(figsize=(9, 6)).gca()
df.plot.hexbin(x = 'Hour', y = 'HeadCount', gridsize = 15, ax = ax)
ax.set_title('Hour vs Headcounts')
ax.set_ylabel('Headcount')
ax.set_xlabel('Hour')
With the hexbin plot we can clearly see that between 5am and 11am are the lowest visitor counts and between 9pm to 2am are our highest visitor counts.
Now let's look at the TablesOpen against the Headcount.
#Scatterplot of TablesOpen and HeadCount
sns.lmplot(x = 'TablesOpen', y = 'HeadCount',
data = df,
scatter_kws={'alpha':0.3
},
fit_reg = False)
plt.xlabel('Tables Open')
plt.ylabel('Headcount')
plt.title('Tables Open vs. Headcounts')
As we can see this shows that in general the more tables are open the higher the headcounts although as it can be seen there are times were there are many tables open 11, 18 and there are no visitors. There are also times were only 6 tables were open and the headcount was around 90. We can also see that it is very rare for there to be more than 22 tables open however whenever that happens there are at minimum around 30 visitors.
Let's take a look at Tables Occupied in relation to the Headcount next.
#Scatterplot of TablesOcc and HeadCount
sns.lmplot(x = 'TablesOcc', y = 'HeadCount',
data = df,
scatter_kws={'alpha':0.3
},
fit_reg = False)
plt.xlabel('Tables Occ')
plt.ylabel('Headcount')
plt.title('Tables Occ vs. Headcounts')
As we can see this shows that in general the more tables are occupied the higher the headcounts. There are however times were only 6 tables were occupied and the headcount was around 90. We can also see that it is very rare for there to be more than 22 tables occupied however whenever that happens there are at minimum around 60 visitors. The minimum headcounts seem to show exponential growth in relation to tables occupied. What this graph is showing is very logical as the more tables are occupied the more visitors there are. We can assume on the high headcounts for small amount of table days that there must have been some sort of tournament for a certain game which would then draw many visitors to a smaller number of tables.
Let's take a look at the day of the week in relation to the headcount next.
#Scatterplot of Day of Week and HeadCount
sns.lmplot(x = 'DayOfWeek', y = 'HeadCount',
data = df,
scatter_kws={'alpha':0.3
},
fit_reg = False)
plt.xlabel('DayOfWeek')
plt.ylabel('Headcount')
plt.title('DayOfWeek vs. Headcounts')
It looks like there may be a small increase of visitors around the weekend (Days 6, 7, and 1). However, it is hard to tell from this plot. Let's take a look at the box plot of the day of week and the headcount.
#Boxplot HeadCount by Day of Week
fig = plt.figure(figsize=(9, 9))
ax = fig.gca()
df.loc[:,['DayOfWeek','HeadCount']].boxplot(by = ['DayOfWeek'] , ax = ax)
ax.set_title('Box plot of Headcounts in relation to DayOfWeek')
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0)
There are many outliers present but looking at the box plots we can see that Saturday and Sunday are the most popular days, having the highest average around 8 and the highest upper whiskers around 28. Monday and Friday seem to have similiar averages around 6 with Monday having upper whiskers around 20 and friday having upper whiskers around 22. Tuesday-Thursday are all very close with Wednesday attracting the least visitors.
Let's take a look at a hexbin plot to see if we can get a better view of this.
ax = plt.figure(figsize=(9, 6)).gca()
df.plot.hexbin(x = 'DayOfWeek', y = 'HeadCount', gridsize = 15, ax = ax)
ax.set_title('Day Of Week vs Headcounts')
ax.set_ylabel('Headcount')
ax.set_xlabel('Day Of Week')
Looking at the hexbin plot we can see that Saturday and Sunday do draw the highest visitors. Monday, Wednesday and Friday draw smaller crowds than the weekend, but we can see that Tuesday and Thursday actually draw the smallest crowds.
Let's look at Headcount by Month next.
#Scatterplot of Month and HeadCount
sns.lmplot(x = 'Month', y = 'HeadCount',
data = df,
scatter_kws={'alpha':0.3
},
fit_reg = False)
plt.xlabel('Month')
plt.ylabel('Headcount')
plt.title('Month vs. Headcounts')
This scatter plot doesn't provide us with much information.
Let's look at the boxplot to see if that shows us anything.
#Boxplot Headcount by Month
fig = plt.figure(figsize=(12, 9))
ax = fig.gca()
df.loc[:,['Month','HeadCount']].boxplot(by = ['Month'] , ax = ax)
ax.set_title('Box plot of Headcounts in relation to Month')
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0)
This doesn't show us too much more information. All months seem to have similiar averages, with upper whiskers around 20. March, June and July have upper whiskers around 22.
Let's look at the Hexbin plot to see if it gives us more information.
ax = plt.figure(figsize=(9, 6)).gca()
df.plot.hexbin(x = 'Month', y = 'HeadCount', gridsize = 11, ax = ax)
ax.set_title('Month vs Headcounts')
ax.set_ylabel('Headcount')
ax.set_xlabel('Month')
Looking at this graph we see that most of the months are pretty similiar; however, it does seem that there may be a slight increase in visitors during the summer months.
From our graphs above we can see the things that affect the headcounts the most are the Game Code, the Hour, and the Day of Week. Let's take a look at a couple more graphs.
#Boxplot of hour versus Tables Open
fig = plt.figure(figsize=(12, 9))
ax = fig.gca()
df.loc[:,['Hour','TablesOpen']].boxplot(by = ['Hour'] , ax = ax)
ax.set_title('Box plot of Tables Open in relation to Hour')
ax.set_ylabel('Tables Open')
ax.set_ylim(-1, 25.0)
Looking at the above box plot we can see that the casino generally has the most tables open from 8pm-1am, an average of 3 and upper whiskers of 12. The second most tables are open from 2am and 1pm-7pm with an average of 2 and upper whiskers of 8. We can see that on average there are 2 tables open, except from 8pm-1am where there are 3 tables open.
Looking at the above graph it seems that the casino is close to opening their tables at the most optimal times in relation to their headcounts. If we remember from our earlier graphs we learned that between 5am and 11am the lowest visitor counts are drawn, while the casino has the least tables open from 6-9am. We also learned that between 9pm to 2am are our highest visitor counts while the casino has the most tables open from 8pm-1am.
#Scatterplot of Tables Open vs Tables Occupied
sns.lmplot(x = 'TablesOpen', y = 'TablesOcc',
data = df,
hue = "Hour_bin",
palette="Set1",
scatter_kws={'alpha':0.3
},
fit_reg = False)
plt.xlabel('Tables Open')
plt.ylabel('Tables Occupied')
plt.title('Tables Open vs. Tables Occupied \n with Hour Bin by color')
The above graph shows us Tables Open Vs Tables Occupied colored in by the Hour Bin column. This graph shows us that in general there is a positive linear relationship between the tables occupied and the tables open. This graph also shows that most hours overlap, however we can get some information about the extremes. We can see that our top right most point, which shows us our most tables open and most tables occupied is a redish purple color This shows us that our top point and the one below it are a mixture of our bin 3 and bin 0 which are our evening and late night/early morning bins. We can also see that in general the dots that fall below the linear distribution between tables open and tables occupied are for hour bins 1 and 2 which are our morning and afternoon time bins. We determined those hours to be our least popular, so it is not surprising that large numbers of open tables would go unoccupied. The casino should be more careful about opening many tables during the morning and afternoon as they have a higher chance of going unoccupied.
#Scatterplot of Tables Open vs Tables Occupied
sns.lmplot(x = 'GameCode', y = 'HeadCount',
data = df,
hue = "Hour_bin",
palette="Set1",
scatter_kws={'alpha':0.3},
fit_reg = False)
plt.xticks(rotation= 90)
plt.xlabel('GameCode')
plt.ylabel('Headcount')
plt.title('GameCode vs. Headcounts \n with Hour bin by color')
The above graph shows us Game Code vs Headcount colored in by the Hour Bin column. This graph has a lot of overlapping however, we can still draw some information from it. The top points for each game code are either red or purple. This confirms what we've learned from our exploration that the highest headcounts at each game can be found during the evening to late night hours. The lowest discernable points for most game codes are blue and green confirming to us that the our lowest headcounts are generally in our morning and afternoon hours.
g = sns.FacetGrid(df,
col="DayOfWeek",
row='GameCode',
hue="TablesOcc_bin",
palette="Set1")
g = g.map(plt.scatter, "Hour", "HeadCount")
The above facet grid helps give us a break down of our headcounts by hour colored in by tables occupied bin. The Row of the facet grids shows us the different games offered at the casino and our columns show the different days of the week.
We can see that in general the headcounts per hour are not too affected by the day of the week. We can however see that it does affect the number of tables occupied with an increase in tables occupied on the weekends (Days 6, 7 and 1). We can see on Thursday's there are no visitors playing Baccarat. We can see that Baccarat and MS are by far the least popular games. We also learn that most games only fall into our first bin for tables occupied which is 0-6 tables occupied. There are only four games that have more than 6 tables occupied at any time. They are DH, S6, SH, TP which may have up to 12 tables occupied. Out of these only S6 has had more than 12 tables occupied, reaching up to 24 tables occupied.
We can see that our most popular hours 9pm-2am have the highest headcounts for our games, and they also have the highest tables occupied for each game. During the weekends it looks as if our peak hours extends from 9pm-2am to 2pm-2am. This can be seen with the higher headcounts and more tables occupied, and is very obvious when looking at the S6 graph.
We have reached the end of our journey. We have learned quite a lot including which games are the most popular (S6, CR and TP), which hours are the most popular (9pm-2am), which days are the most popular (Friday, Saturday, Sunday), when it is best to have the most tables open, and even which are the most popular months (June, and July).