Data Visualization - Complementary Views for Casino Games

Abstract

A certain casino has contracted our team to determine if there are any patterns among their tables and patron attendance. They are not asking for predictions, just for a description of what is already there.

Load and Prepare Data

First let's begin by loading our libraries and Data. Then let's take a quick look at the data we've loaded and some descriptive statistics for the data we've loaded.

In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# Load data
df = pd.read_csv(filename)
#Print descriptive statistics for numerical values
df.describe()
Out[2]:
Hour TablesOcc TablesOpen TablesClosed HeadCount DayOfWeek DayNumber
count 175677.000000 175677.000000 175677.000000 175677.000000 175677.000000 175677.000000 175677.000000
mean 11.504431 2.074591 2.555804 1.853430 7.390228 3.992953 183.434303
std 6.922330 3.307518 3.673229 2.993767 12.458613 1.997989 105.702167
min 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 6.000000 0.000000 1.000000 0.000000 0.000000 2.000000 92.000000
50% 12.000000 1.000000 1.000000 1.000000 3.000000 4.000000 183.000000
75% 18.000000 2.000000 3.000000 3.000000 8.000000 6.000000 275.000000
max 23.000000 24.000000 24.000000 23.000000 109.000000 7.000000 366.000000

As we can see from above our tables open and tables occupied range from 0 to 23/24. However the mean is closer to 2 and all have medians at 1. This shows that in general very few tables are open but there have been instances with up to 25 tables opened. The Headcount column ranges from 0 to 109 with a mean of 7 and a median of 3. The hours range from 0 to 23 representing 24 hours in a day. Day of week ranges from 1 to 7 representing everyday in a week, 1 being Sunday. Day Number ranges from 1 to 366 representing the 366 days in a year covered by this dataset with 1 representing 8/1/2011.

Now let's look at the shape and data types of our data, as well as our top 5 and bottom 5 rows. Let's also check for any NaN values.

In [3]:
#Print shape and data types
print(df.shape)
print(df.dtypes)
(175677, 9)
GameCode        object
DateFormat      object
Hour             int64
TablesOcc        int64
TablesOpen       int64
TablesClosed     int64
HeadCount        int64
DayOfWeek        int64
DayNumber        int64
dtype: object
In [4]:
#Print top 5 and bottom 5 rows
print(df.head())
print(df.tail())
  GameCode DateFormat  Hour  TablesOcc  TablesOpen  TablesClosed  HeadCount  \
0       BA  9/16/2011     6          0           0             1          0   
1       BA  9/16/2011     7          0           0             1          0   
2       BA  9/16/2011     8          0           0             1          0   
3       BA  9/16/2011     9          0           0             1          0   
4       BA  9/16/2011    10          0           1             0          0   

   DayOfWeek  DayNumber  
0          6         47  
1          6         47  
2          6         47  
3          6         47  
4          6         47  
       GameCode  DateFormat  Hour  TablesOcc  TablesOpen  TablesClosed  \
175672       WR  10/31/2011    19          0           0             1   
175673       WR  10/31/2011    20          1           1             0   
175674       WR  10/31/2011    21          0           1             0   
175675       WR  10/31/2011    22          1           1             0   
175676       WR  10/31/2011    23          0           1             0   

        HeadCount  DayOfWeek  DayNumber  
175672          0          2         92  
175673          1          2         92  
175674          0          2         92  
175675          1          2         92  
175676          0          2         92  
In [5]:
#Print the counts of NaN values
print(df.isnull().sum())
GameCode        0
DateFormat      0
Hour            0
TablesOcc       0
TablesOpen      0
TablesClosed    0
HeadCount       0
DayOfWeek       0
DayNumber       0
dtype: int64

Now that we've loaded our data and taken a quick look at what we've loaded. We can see that we have no missing values, and our columns are mostly the correct type. However, our DateFormat column is an object and should be changed to a date column.

In [6]:
df.loc[:, 'DateFormat'] = pd.to_datetime(df.loc[:, 'DateFormat'])

Now let's create a Month and Year column from our DateFormat column. Let's also create a copy of our data frame and name it df2. We will set our dateformat column as our index for this data frame so we can use it for creating time plots.

In [7]:
#Create month and year columns
df['Month'] = pd.DatetimeIndex(df['DateFormat']).month
df['Year'] = pd.DatetimeIndex(df['DateFormat']).year
In [8]:
#create Df2 as copy and set date as index for time plots
df2 = df.copy(deep=True)
df2.loc[:, 'DateFormat'] = pd.to_datetime(df.loc[:, 'DateFormat'])
df2.set_index('DateFormat', inplace = True)

Now let's go ahead and look at the Unique values or the number of unique values for each of our columns.

In [9]:
print('GameCode Number Unique: ', df['GameCode'].nunique(), '\n') #Print number of uniques 
print(df['GameCode'].unique()) #Print unique values
GameCode Number Unique:  22 

['BA' 'C4' 'CR' 'DH' 'FP' 'LI' 'MI' 'MQ' 'MR' 'MS' 'ND' 'PA' 'PG' 'RO' 'RR'
 'S6' 'SH' 'TH' 'TL' 'TP' 'UT' 'WR']
In [10]:
print('DateFormat Number Unique: ', df['DateFormat'].nunique()) #Print number of uniques 
DateFormat Number Unique:  366
In [11]:
print('Hour Number Unique: ', df['Hour'].nunique(), '\n') #Print number of uniques 
print(df['Hour'].unique()) #Print unique values
Hour Number Unique:  24 

[ 6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  0  1  2  3  4  5]
In [12]:
print('TablesOcc Number Unique: ', df['TablesOcc'].nunique(), '\n') #Print number of uniques 
print(df['TablesOcc'].unique()) #Print unique values
TablesOcc Number Unique:  25 

[ 0  1  2  6  4  5  3  7 22 20 15  8  9 14 12 18 17 21 13 16 19 11 10 24 23]
In [13]:
print('TablesOpen Number Unique: ', df['TablesOpen'].nunique(), '\n') #Print number of uniques 
print(df['TablesOpen'].unique()) #Print unique values
TablesOpen Number Unique:  25 

[ 0  1  2  6  4  3  5  7 22 20 15  9  8 17 19 21 16 13 10 12 14 24 11 18 23]
In [14]:
print('TablesClosed Number Unique: ', df['TablesClosed'].nunique(), '\n') #Print number of uniques 
print(df['TablesClosed'].unique()) #Print unique values
TablesClosed Number Unique:  24 

[ 1  0  2  4  3  5  6  7 13 14 18 17 20 16  9 12 21 15 22 10  8 11 19 23]
In [15]:
print('HeadCount Number Unique: ', df['HeadCount'].nunique(), '\n') #Print number of uniques 
print(df['HeadCount'].unique()) #Print unique values
HeadCount Number Unique:  110 

[  0   1  12  10   3   8   5   2   6   4   9   7  11  53  62  44  24  18
  26  39  31  30  43  50  51  55  40  15  19  14  23  22  33  17  21  13
  29  35  28  37  20  16  41  87  70  36  27  42  38  45  47  57  64  48
  32  25  65  77  34  49  60  67  54  72  46  58  74  61  52  56  63  79
  59  66  75  68  81  69  78  73  71  76  84  82  92  83  91  99  95  89
  86  80  85  94  98  97  90 104  96 103 106  93 101 100  88 105 102 108
 109 107]
In [16]:
print('DayofWeek: ', df['DayOfWeek'].unique()) #Print unique values
DayofWeek:  [6 7 1 2 3 4 5]
In [17]:
print('DayNumber Number Unique: ', df['DayNumber'].nunique()) #Print number of uniques 
DayNumber Number Unique:  366
In [18]:
print('Month: ', df['Month'].unique()) #Print unique values
Month:  [ 9  4  8 12  2  1  7  6  3  5 11 10]
In [19]:
print('Year: ', df['Year'].unique()) #Print unique values
Year:  [2011 2012]

Now let's bin some of our variables to use later for plotting. Let's also print our bin ranges so we can know what our bins represent.

We will bin the Hours column into 4 bins. We will bin the Tables Occupied column into 4 bins. We will bin the Headcount column into 20 bins.

In [20]:
HourBin = pd.cut(df.loc[:, 'Hour'], bins = 4, labels = range(4), retbins = True) 
df.loc[:, 'Hour_bin'] = HourBin[0]
print(HourBin[1])
[ -0.023   5.75   11.5    17.25   23.   ]

The Hours are binned into 4 categories. This bins them in 6 hour bins. 0(LateNight) from midnight to 5am, 1(Morning) from 6am to 11am, 2(Afternoon) from 12(noon) to 17(5pm) and finally 3(night) from 18(6pm) to 23(11pm).

In [21]:
TablesOccBin = pd.cut(df.loc[:, 'TablesOcc'], bins = 4, labels = range(4), retbins = True)
df.loc[:, 'TablesOcc_bin'] = TablesOccBin[0]
print(TablesOccBin[1]) 
[ -0.024   6.     12.     18.     24.   ]

The Tables are binned into 4 categories. This bins them in counts of 6 with the first bin having 7. 0-6 tables is Category 0, 7-12 tables is Category 1, 13-18 tables is Category 2, 19-24 tables is Category 3.

In [22]:
HeadCountBin = pd.cut(df.loc[:, 'HeadCount'], bins = 20, labels = range(20), retbins = True)
df.loc[:, 'HeadCount_bin'] = HeadCountBin[0] 
print(HeadCountBin[1]) 
[  -0.109    5.45    10.9     16.35    21.8     27.25    32.7     38.15
   43.6     49.05    54.5     59.95    65.4     70.85    76.3     81.75
   87.2     92.65    98.1    103.55   109.   ]

The Headcounts are binned into 20 categories which causes them to be in bins of 5 or 6 headcounts. The lowest range is from 0 to 5 and the highest range is from 104-109.

Let's go ahead and set our binned columns as ints.

In [23]:
#Function to set column as int
def setInt(df, i):
     df.loc[:,i] = df.loc[:,i].astype('int64')
In [24]:
#Sets the column as Int type
setInt(df,'Hour_bin')
setInt(df,'TablesOcc_bin')
setInt(df,'HeadCount_bin')

Let's take a look at our data frame now.

In [25]:
#Print top 5 rows
print(df.head())
  GameCode DateFormat  Hour  TablesOcc  TablesOpen  TablesClosed  HeadCount  \
0       BA 2011-09-16     6          0           0             1          0   
1       BA 2011-09-16     7          0           0             1          0   
2       BA 2011-09-16     8          0           0             1          0   
3       BA 2011-09-16     9          0           0             1          0   
4       BA 2011-09-16    10          0           1             0          0   

   DayOfWeek  DayNumber  Month  Year  Hour_bin  TablesOcc_bin  HeadCount_bin  
0          6         47      9  2011         1              0              0  
1          6         47      9  2011         1              0              0  
2          6         47      9  2011         1              0              0  
3          6         47      9  2011         1              0              0  
4          6         47      9  2011         1              0              0  
In [26]:
#print data types
print(df.dtypes)
GameCode                 object
DateFormat       datetime64[ns]
Hour                      int64
TablesOcc                 int64
TablesOpen                int64
TablesClosed              int64
HeadCount                 int64
DayOfWeek                 int64
DayNumber                 int64
Month                     int64
Year                      int64
Hour_bin                  int64
TablesOcc_bin             int64
HeadCount_bin             int64
dtype: object
In [27]:
#Print column names
print(df.columns)
Index(['GameCode', 'DateFormat', 'Hour', 'TablesOcc', 'TablesOpen',
       'TablesClosed', 'HeadCount', 'DayOfWeek', 'DayNumber', 'Month', 'Year',
       'Hour_bin', 'TablesOcc_bin', 'HeadCount_bin'],
      dtype='object')

Data Exploration

We are ready to do some plotting and to find some relationships.

Let's start by looking at the histogram of our headcounts. We will use 10 bins. Since we have 110 values from 0 to 109 our bins each have 11 values. Our first from 0 to 10.

In [28]:
%matplotlib inline
df.loc[:,'HeadCount'].plot.hist(bins = 10)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f8bac7c88>

As we can see our 0 to 10 bin is the largest bin and our bins continuously decrease. Our headcount column's distribution is that of an exponential decay.

Now let's take a look at the time plot of our headcount.

In [29]:
#Timeplot of HeadCount
ax = plt.figure(figsize=(12, 6)).gca() 
df2.loc[:,'HeadCount'].plot(ax=ax)
ax.set_xlabel('Date') 
ax.set_ylabel('Headcount') 
ax.set_title('Time series of Headcounts')
Out[29]:
Text(0.5,1,'Time series of Headcounts')

We can see that our headcounts seem to follow a cyclical cycle with peaks around 109 and troughs around 55. There seems to be around 4 peaks a month. This may coincide with the weekends and the troughs being the middle of the week.

Now let's take a look at some attributes in comparassion to headcount.

Let's start by looking at Headcount in relation to the Game Code.

In [30]:
#Create scatter plot with transparency of .3 for each point. 
sns.lmplot(x = 'GameCode', y = 'HeadCount', 
           data = df, 
           scatter_kws={'alpha':0.3 
                       },
           fit_reg = False)

plt.xticks(rotation= 90)
plt.xlabel('GameCode') 
plt.ylabel('Headcount') 
plt.title('GameCode vs. Headcounts') 
Out[30]:
Text(0.5,1,'GameCode vs. Headcounts')

We can see that S6 draws the highest headcounts, followed by CR and then TP. These must be our 3 most popular games.

Let's take a look at the box plot of Gamecode vs Headcount to see if we can gain some further insight on this relationship.

In [31]:
#Boxplot Headcount of Game Code
fig = plt.figure(figsize=(12, 9))
ax = fig.gca()  
df.loc[:,['GameCode','HeadCount']].boxplot(by = ['GameCode'] , ax = ax) 
plt.xticks(rotation=90) 
ax.set_title('Box plot of Headcounts in relation to GameCode') 
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0) 
Out[31]:
(-5.0, 120.0)

Looking at the box plot we get much more granularity about this relationship.

S6, Big Six, is by far the most popular game at this casino. It has the highest median, drawing on average 38 visitors. It also draws the maximum visitor count of all games at 109. The maximum visitor count falls at the top of the box plots whiskers. This shows that the high number of visitors is not an outlier.

CR, Craps, is the second most popular game at this casino. It has a median of 19 visitors and draws a maximum visitor count of around 88. However, the maximum visitor count is an outlier. The maximum visitor count within 1.5 times the interquartile range is actually around 68.

TP, Texas Hold'em Poker, is the third most popular game at this casino. It has a median of 17 visitors and draws a maximum visitor count of around 66. However, the maximum visitor count is an outlier. The maximum visitor count within 1.5 times the interquartile range is actually around 60.

All other games draw much smaller counts with most box plots having whiskers that fall below the third quartile of TP and CR. RO, roulette, and FP, four card poker, are the only games to come up around the third quartile of TP and CR.

Let's take a look at the heatmap of Headcount versus gamecode. To get better information from this graph let's use the Headcount_bin instead of Headcount.

In [32]:
game_count = pd.crosstab(df.loc[:, 'HeadCount_bin'], df.loc[:, 'GameCode']) 
num_games = game_count.apply(sum, axis = 0) 
game_count = game_count.div(num_games, axis = 1) 
print(game_count.head())
GameCode             BA        C4        CR        DH        FP        LI  \
HeadCount_bin                                                               
0              0.940678  0.837434  0.172818  0.458761  0.256778  0.413990   
1              0.033898  0.158237  0.126111  0.349168  0.226361  0.255981   
2              0.025424  0.004329  0.137275  0.160287  0.248234  0.232171   
3              0.000000  0.000000  0.110617  0.029620  0.163363  0.091707   
4              0.000000  0.000000  0.128161  0.002165  0.096263  0.006152   

GameCode             MI        MQ        MR        MS    ...           PG  \
HeadCount_bin                                            ...                
0              0.836964  0.995785  0.675439  0.714286    ...     0.980519   
1              0.157010  0.004215  0.263728  0.272109    ...     0.017316   
2              0.006025  0.000000  0.056961  0.013605    ...     0.002165   
3              0.000000  0.000000  0.003759  0.000000    ...     0.000000   
4              0.000000  0.000000  0.000114  0.000000    ...     0.000000   

GameCode             RO        RR        S6        SH        TH       TL  \
HeadCount_bin                                                              
0              0.349966  0.811119  0.159945  0.385737  0.801549  0.98576   
1              0.230576  0.170426  0.064707  0.259626  0.185805  0.01424   
2              0.180337  0.018455  0.059125  0.183641  0.012645  0.00000   
3              0.123832  0.000000  0.046480  0.099339  0.000000  0.00000   
4              0.091023  0.000000  0.046936  0.050809  0.000000  0.00000   

GameCode             TP        UT        WR  
HeadCount_bin                                
0              0.259512  0.787993  0.997835  
1              0.112896  0.201754  0.002165  
2              0.138300  0.010253  0.000000  
3              0.119389  0.000000  0.000000  
4              0.130668  0.000000  0.000000  

[5 rows x 22 columns]
In [33]:
ax = plt.figure(figsize=(9, 6)).gca() 
ax.pcolor(game_count, cmap = 'Blues')
ax.set_xticks(range(game_count.shape[1])) 
ax.set_xticklabels(game_count.columns, rotation=90) 
ax.set_xlabel('GameCodes') 
ax.set_ylabel('Headcount bin')
ax.set_title('Headcounts of people by GameCode') 
Out[33]:
Text(0.5,1,'Headcounts of people by GameCode')

Looking at the above heatmap we can see that most games draw 0 to 5 visitors due to the darkness of the blues near the bottom Headcount bin. The lighter the color at the bottom bin the more popular the game is. So we can see that although some games are not nearly as popular as our top games that they are still more popular than others. S6 is the clearly the most popular game as it has the lightest color at 0 and it has heat all the way up to around the top headcount bin. Our Top 5 games in order of popularity seem to be S6: Big Six, CR: Craps, TP: Texas Hold'em Poker, RO: Roulette, FP: Four Card Poker. Our least popular games seem to be BA: Baccarat, MQ: Blackjack variant, ND: Blackjack variant, PA: Pai Gow Tiles, PG: Pai Gow, TL: Blackjack variant and WR: Casino War.

Let's keep going by looking at the Scatterplot of Headcount versus the Hour to see if we can spot a relationship between the visitors and the time of day.

In [34]:
#Scatterplot of Hour and Headcount
sns.lmplot(x = 'Hour', y = 'HeadCount', 
           data = df, 
           scatter_kws={'alpha':0.3 
                       },
           fit_reg = False)
plt.xlabel('Hour') 
plt.ylabel('Headcount') 
plt.title('Hour vs. Headcounts') 
Out[34]:
Text(0.5,1,'Hour vs. Headcounts')

We can see that there seems to be a trend between the headcount at the casino and the hour of the day.

Let's take a look at the box plot of Hours vs Headcount to see if we can gain some further insight on this relationship.

In [35]:
#Boxplot HeadCount by Hour
fig = plt.figure(figsize=(9, 9)) 
ax = fig.gca()  
df.loc[:,['Hour','HeadCount']].boxplot(by = ['Hour'] , ax = ax)
ax.set_title('Box plot of Headcounts in relation to Hour') 
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0) 
Out[35]:
(-5.0, 120.0)

Looking at the boxplots we can see that the median visitor counts are around the same from 1pm to 1am. 5am to 9am draws almost no visitors with their medians around 0. However looking at the upper quartiles, whiskers, and the outliers we can see that the highest headcounts can be seen from 9pm-2am, then there is a pretty significant drop each hour from 1am until 6am where the casino's headcount reaches the lows. The headcount stays low until 11am where it begins to rise until 1pm then it stays steady until the crowd reaches their maximum around 9pm. The high hours can have up to 110 patrons while the low hours can only have up to 20 patrons.

Let's take a look at the hexbin plot of HeadCount versus hours for more detail.

In [36]:
ax = plt.figure(figsize=(9, 6)).gca() 
df.plot.hexbin(x = 'Hour', y = 'HeadCount', gridsize = 15, ax = ax)
ax.set_title('Hour vs Headcounts') 
ax.set_ylabel('Headcount')
ax.set_xlabel('Hour') 
Out[36]:
Text(0.5,0,'Hour')

With the hexbin plot we can clearly see that between 5am and 11am are the lowest visitor counts and between 9pm to 2am are our highest visitor counts.

Now let's look at the TablesOpen against the Headcount.

In [37]:
#Scatterplot of TablesOpen and HeadCount
sns.lmplot(x = 'TablesOpen', y = 'HeadCount', 
           data = df, 
           scatter_kws={'alpha':0.3
                       },
           fit_reg = False)
plt.xlabel('Tables Open')
plt.ylabel('Headcount') 
plt.title('Tables Open vs. Headcounts')
Out[37]:
Text(0.5,1,'Tables Open vs. Headcounts')

As we can see this shows that in general the more tables are open the higher the headcounts although as it can be seen there are times were there are many tables open 11, 18 and there are no visitors. There are also times were only 6 tables were open and the headcount was around 90. We can also see that it is very rare for there to be more than 22 tables open however whenever that happens there are at minimum around 30 visitors.

Let's take a look at Tables Occupied in relation to the Headcount next.

In [38]:
#Scatterplot of TablesOcc and HeadCount
sns.lmplot(x = 'TablesOcc', y = 'HeadCount', 
           data = df, 
           scatter_kws={'alpha':0.3 
                       },
           fit_reg = False)
plt.xlabel('Tables Occ') 
plt.ylabel('Headcount') 
plt.title('Tables Occ vs. Headcounts') 
Out[38]:
Text(0.5,1,'Tables Occ vs. Headcounts')

As we can see this shows that in general the more tables are occupied the higher the headcounts. There are however times were only 6 tables were occupied and the headcount was around 90. We can also see that it is very rare for there to be more than 22 tables occupied however whenever that happens there are at minimum around 60 visitors. The minimum headcounts seem to show exponential growth in relation to tables occupied. What this graph is showing is very logical as the more tables are occupied the more visitors there are. We can assume on the high headcounts for small amount of table days that there must have been some sort of tournament for a certain game which would then draw many visitors to a smaller number of tables.

Let's take a look at the day of the week in relation to the headcount next.

In [39]:
#Scatterplot of Day of Week and HeadCount
sns.lmplot(x = 'DayOfWeek', y = 'HeadCount', 
           data = df, 
           scatter_kws={'alpha':0.3 
                       },
           fit_reg = False)
plt.xlabel('DayOfWeek') 
plt.ylabel('Headcount') 
plt.title('DayOfWeek vs. Headcounts') 
Out[39]:
Text(0.5,1,'DayOfWeek vs. Headcounts')

It looks like there may be a small increase of visitors around the weekend (Days 6, 7, and 1). However, it is hard to tell from this plot. Let's take a look at the box plot of the day of week and the headcount.

In [40]:
#Boxplot HeadCount by Day of Week
fig = plt.figure(figsize=(9, 9)) 
ax = fig.gca() 
df.loc[:,['DayOfWeek','HeadCount']].boxplot(by = ['DayOfWeek'] , ax = ax)
ax.set_title('Box plot of Headcounts in relation to DayOfWeek') 
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0) 
Out[40]:
(-5.0, 120.0)

There are many outliers present but looking at the box plots we can see that Saturday and Sunday are the most popular days, having the highest average around 8 and the highest upper whiskers around 28. Monday and Friday seem to have similiar averages around 6 with Monday having upper whiskers around 20 and friday having upper whiskers around 22. Tuesday-Thursday are all very close with Wednesday attracting the least visitors.

Let's take a look at a hexbin plot to see if we can get a better view of this.

In [41]:
ax = plt.figure(figsize=(9, 6)).gca() 
df.plot.hexbin(x = 'DayOfWeek', y = 'HeadCount', gridsize = 15, ax = ax) 
ax.set_title('Day Of Week vs Headcounts') 
ax.set_ylabel('Headcount')
ax.set_xlabel('Day Of Week') 
Out[41]:
Text(0.5,0,'Day Of Week')

Looking at the hexbin plot we can see that Saturday and Sunday do draw the highest visitors. Monday, Wednesday and Friday draw smaller crowds than the weekend, but we can see that Tuesday and Thursday actually draw the smallest crowds.

Let's look at Headcount by Month next.

In [42]:
#Scatterplot of Month and HeadCount
sns.lmplot(x = 'Month', y = 'HeadCount', 
           data = df, 
           scatter_kws={'alpha':0.3 
                       },
           fit_reg = False)
plt.xlabel('Month')
plt.ylabel('Headcount') 
plt.title('Month vs. Headcounts') 
Out[42]:
Text(0.5,1,'Month vs. Headcounts')

This scatter plot doesn't provide us with much information.

Let's look at the boxplot to see if that shows us anything.

In [43]:
#Boxplot Headcount  by Month
fig = plt.figure(figsize=(12, 9))
ax = fig.gca() 
df.loc[:,['Month','HeadCount']].boxplot(by = ['Month'] , ax = ax) 
ax.set_title('Box plot of Headcounts in relation to Month') 
ax.set_ylabel('Headcount')
ax.set_ylim(-5.0, 120.0) 
Out[43]:
(-5.0, 120.0)

This doesn't show us too much more information. All months seem to have similiar averages, with upper whiskers around 20. March, June and July have upper whiskers around 22.

Let's look at the Hexbin plot to see if it gives us more information.

In [44]:
ax = plt.figure(figsize=(9, 6)).gca() 
df.plot.hexbin(x = 'Month', y = 'HeadCount', gridsize = 11, ax = ax) 
ax.set_title('Month vs Headcounts') 
ax.set_ylabel('Headcount')
ax.set_xlabel('Month') 
Out[44]:
Text(0.5,0,'Month')

Looking at this graph we see that most of the months are pretty similiar; however, it does seem that there may be a slight increase in visitors during the summer months.

From our graphs above we can see the things that affect the headcounts the most are the Game Code, the Hour, and the Day of Week. Let's take a look at a couple more graphs.

In [45]:
#Boxplot of hour versus Tables Open
fig = plt.figure(figsize=(12, 9)) 
ax = fig.gca() 
df.loc[:,['Hour','TablesOpen']].boxplot(by = ['Hour'] , ax = ax) 
ax.set_title('Box plot of Tables Open in relation to Hour')
ax.set_ylabel('Tables Open')
ax.set_ylim(-1, 25.0) 
Out[45]:
(-1, 25.0)

Looking at the above box plot we can see that the casino generally has the most tables open from 8pm-1am, an average of 3 and upper whiskers of 12. The second most tables are open from 2am and 1pm-7pm with an average of 2 and upper whiskers of 8. We can see that on average there are 2 tables open, except from 8pm-1am where there are 3 tables open.

Looking at the above graph it seems that the casino is close to opening their tables at the most optimal times in relation to their headcounts. If we remember from our earlier graphs we learned that between 5am and 11am the lowest visitor counts are drawn, while the casino has the least tables open from 6-9am. We also learned that between 9pm to 2am are our highest visitor counts while the casino has the most tables open from 8pm-1am.

In [46]:
#Scatterplot of Tables Open vs Tables Occupied
sns.lmplot(x = 'TablesOpen', y = 'TablesOcc', 
           data = df,
           hue = "Hour_bin", 
           palette="Set1",            
           scatter_kws={'alpha':0.3 
                       },
           fit_reg = False)
plt.xlabel('Tables Open') 
plt.ylabel('Tables Occupied')
plt.title('Tables Open vs. Tables Occupied \n with Hour Bin by color') 
Out[46]:
Text(0.5,1,'Tables Open vs. Tables Occupied \n with Hour Bin by color')

The above graph shows us Tables Open Vs Tables Occupied colored in by the Hour Bin column. This graph shows us that in general there is a positive linear relationship between the tables occupied and the tables open. This graph also shows that most hours overlap, however we can get some information about the extremes. We can see that our top right most point, which shows us our most tables open and most tables occupied is a redish purple color This shows us that our top point and the one below it are a mixture of our bin 3 and bin 0 which are our evening and late night/early morning bins. We can also see that in general the dots that fall below the linear distribution between tables open and tables occupied are for hour bins 1 and 2 which are our morning and afternoon time bins. We determined those hours to be our least popular, so it is not surprising that large numbers of open tables would go unoccupied. The casino should be more careful about opening many tables during the morning and afternoon as they have a higher chance of going unoccupied.

In [47]:
#Scatterplot of Tables Open vs Tables Occupied
sns.lmplot(x = 'GameCode', y = 'HeadCount', 
           data = df, 
           hue = "Hour_bin", 
           palette="Set1",
           scatter_kws={'alpha':0.3}, 
           fit_reg = False)
plt.xticks(rotation= 90)
plt.xlabel('GameCode') 
plt.ylabel('Headcount') 
plt.title('GameCode vs. Headcounts \n with Hour bin by color') 
Out[47]:
Text(0.5,1,'GameCode vs. Headcounts \n with Hour bin by color')

The above graph shows us Game Code vs Headcount colored in by the Hour Bin column. This graph has a lot of overlapping however, we can still draw some information from it. The top points for each game code are either red or purple. This confirms what we've learned from our exploration that the highest headcounts at each game can be found during the evening to late night hours. The lowest discernable points for most game codes are blue and green confirming to us that the our lowest headcounts are generally in our morning and afternoon hours.

In [48]:
g = sns.FacetGrid(df, 
                  col="DayOfWeek", 
                  row='GameCode',  
                  hue="TablesOcc_bin", 
                  palette="Set1") 
g = g.map(plt.scatter, "Hour", "HeadCount") 

The above facet grid helps give us a break down of our headcounts by hour colored in by tables occupied bin. The Row of the facet grids shows us the different games offered at the casino and our columns show the different days of the week.

We can see that in general the headcounts per hour are not too affected by the day of the week. We can however see that it does affect the number of tables occupied with an increase in tables occupied on the weekends (Days 6, 7 and 1). We can see on Thursday's there are no visitors playing Baccarat. We can see that Baccarat and MS are by far the least popular games. We also learn that most games only fall into our first bin for tables occupied which is 0-6 tables occupied. There are only four games that have more than 6 tables occupied at any time. They are DH, S6, SH, TP which may have up to 12 tables occupied. Out of these only S6 has had more than 12 tables occupied, reaching up to 24 tables occupied.

We can see that our most popular hours 9pm-2am have the highest headcounts for our games, and they also have the highest tables occupied for each game. During the weekends it looks as if our peak hours extends from 9pm-2am to 2pm-2am. This can be seen with the higher headcounts and more tables occupied, and is very obvious when looking at the S6 graph.

Conclusion

We have reached the end of our journey. We have learned quite a lot including which games are the most popular (S6, CR and TP), which hours are the most popular (9pm-2am), which days are the most popular (Friday, Saturday, Sunday), when it is best to have the most tables open, and even which are the most popular months (June, and July).