Teen Market Segmentation Using K-means Clustering¶

Interacting with friends on a social networking service (SNS) has become a rite of passage for teenagers around the world. The many millions of teenage consumers using such sites have attracted the attention of marketers struggling to find an edge in an increasingly competitive market. One way to gain this edge is to identify segments of teenagers who share similar tastes, so that clients can avoid targeting advertisements to teens with no interest in the product being sold. For instance, sporting apparel is likely to be a difficult sell to teens with no interest in sports.

Dataset Information¶

The dataset represents a random sample of 30,000 U.S. high school students who had profiles on a well-known SNS in 2006. To protect the users’ anonymity, the SNS will remain unnamed. The data was sampled evenly across four high school graduation years (2006 through 2009) representing the senior, junior, sophomore, and freshman classes at the time of data collection The dataset contatins 40 variables like: gender, age, friends, basketball, football, soccer, softball, volleyball,swimming, cute, sexy, kissed, sports, rock, god, church, bible, hair, mall, clothes, hollister, drugs etc whcih shows their interests. The final dataset indicates, for each person, how many times each word appeared in the person’s SNS profile

Load Libraries¶

# Importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Load Data¶

pd.set_option('display.max_columns',None)
data = pd.read_csv("C:/Users/user/Projects/Datasets/snsdata.csv")
data.head()

Summary Statistics¶

Summary Statistics of Numerical Variables¶

data.describe()

Summary Statistics of Categorical Variables¶

data.describe(include='object')

Treating Missing Values¶

data.isnull().sum()

gradyear           0
gender          2724
age             5086
friends            0
basketball         0
football           0
soccer             0
softball           0
volleyball         0
swimming           0
cheerleading       0
baseball           0
tennis             0
sports             0
cute               0
sex                0
sexy               0
hot                0
kissed             0
dance              0
band               0
marching           0
music              0
rock               0
god                0
church             0
jesus              0
bible              0
hair               0
dress              0
blonde             0
mall               0
shopping           0
clothes            0
hollister          0
abercrombie        0
die                0
death              0
drunk              0
drugs              0
dtype: int64

A total of 5,086 records have missing ages. Also concerning is the fact that the minimum and maximum values seem to be unreasonable; it is unlikely that a 3 year old or a 106 year old is attending high school.

Let's have a look at the number of male and female candidates in our dataset

data['gender'].value_counts()

F    22054
M     5222
Name: gender, dtype: int64

Let's have a look at the number of male, female and msiing values

data['gender'].value_counts(dropna = False)

F      22054
M       5222
NaN     2724
Name: gender, dtype: int64

There are 22054 female, 5222 male teen students and 2724 missing values

Now we are going to fill all the null values in gender column with “No Gender”

data['gender'].fillna('not disclosed', inplace = True)

data['gender'].isnull().sum()

2724

Also, the age cloumn has 5086 missing values. One way to deal with these missing values would be to fill the missing values with the average age of each graduation year

data.groupby('gradyear')['age'].mean()

gradyear
2006    19.137241
2007    18.391459
2008    17.523867
2009    16.876025
Name: age, dtype: float64

From the above summary we can observe that the mean age differs by roughly one year per change in graduation year. This is not at all surprising, but a helpful finding for confirming our data is reasonable

We now fill the missing values for each graduation year with the mean that we got as above

data['age'] = data.groupby('gradyear').transform(lambda x : x.fillna(x.mean()))

data['age'].isnull().sum()

0

We don't have any missing values in the 'age' column

data.isnull().sum()

gradyear        0
gender          0
age             0
friends         0
basketball      0
football        0
soccer          0
softball        0
volleyball      0
swimming        0
cheerleading    0
baseball        0
tennis          0
sports          0
cute            0
sex             0
sexy            0
hot             0
kissed          0
dance           0
band            0
marching        0
music           0
rock            0
god             0
church          0
jesus           0
bible           0
hair            0
dress           0
blonde          0
mall            0
shopping        0
clothes         0
hollister       0
abercrombie     0
die             0
death           0
drunk           0
drugs           0
dtype: int64

From the above summary we can see that there are no missing values in the dataset

Treating Outliers¶

The original age range contains value from 3 - 106, which is unrealistic because student at age of 3 or 106 would not attend high school. A reasonable age range for people attending high school will be the age range between 13 to 21. The rest should be treated as outliers keeping the age of student going to high school in mind. Let's detect the outliers using a box plot below

sns.boxplot(data['age'])

<matplotlib.axes._subplots.AxesSubplot at 0xa187630>

q1 = data['age'].quantile(0.25)
q3 = data['age'].quantile(0.75)
iqr = q3-q1

print(iqr)

1.887459224069687

df = data[(data['age'] > (q1 - 1.5*iqr)) & (data['age'] < (q3 + 1.5*iqr))]

df['age'].describe()

count    29633.000000
mean        17.377469
std          1.147764
min         13.719000
25%         16.501000
50%         17.426000
75%         18.387000
max         21.158000
Name: age, dtype: float64

From the above summary we can observe that after treating the outliers the mininmum age is 13.719000 and the maximum age is 21.158000

df.shape

(29633, 40)

sns.boxplot(df['age'])

<matplotlib.axes._subplots.AxesSubplot at 0x9b51ba8>

From the above boxplot we observe that there are no outliers in the age column

Data Preprocessing¶

A common practice employed prior to any analysis using distance calculations is to normalize or z-score standardize the features so that each utilizes the same range. By doing so, you can avoid a problem in which some features come to dominate solely because they have a larger range of values than the others.
The process of z-score standardization rescales features so that they have a mean of zero and a standard deviation of one. This transformation changes the interpretation of the data in a way that may be useful here. Specifically, if someone mentions Swimming three times on their profile, without additional information, we have no idea whether this implies they like Swimming more or less than their peers. On the other hand, if the z-score is three, we know that that they mentioned Swimming many more times than the average teenager.

names = df.columns[5:40]
scaled_feature = data.copy()
names

Index(['football', 'soccer', 'softball', 'volleyball', 'swimming',
       'cheerleading', 'baseball', 'tennis', 'sports', 'cute', 'sex', 'sexy',
       'hot', 'kissed', 'dance', 'band', 'marching', 'music', 'rock', 'god',
       'church', 'jesus', 'bible', 'hair', 'dress', 'blonde', 'mall',
       'shopping', 'clothes', 'hollister', 'abercrombie', 'die', 'death',
       'drunk', 'drugs'],
      dtype='object')

features = scaled_feature[names]

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(features.values)

C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

features = scaler.transform(features.values)

C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

scaled_feature[names] = features
scaled_feature.head()

Convert object variable to numeric¶

def gender_to_numeric(x):
    if x=='M':
        return 1
    if x=='F':
        return 2
    if x=='not disclosed':
        return 3

scaled_feature['gender'] = scaled_feature['gender'].apply(gender_to_numeric)
scaled_feature['gender'].head()

0    1
1    2
2    1
3    2
4    3
Name: gender, dtype: int64

Checking the transformed values¶

scaled_feature.head()

Building the K-means model¶

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=0, n_jobs=-1)

model = kmeans.fit(scaled_feature)

Elbow Method¶

# Creating a funtion with KMeans to plot "The Elbow Curve"
wcss = []
for i in range(1,20):
    kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
    kmeans.fit(scaled_feature)
    wcss.append(kmeans.inertia_)
plt.plot(range(1,20),wcss)
plt.title('The Elbow Curve')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') ##WCSS stands for total within-cluster sum of square
plt.show()

The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. Our Elbow point is around cluster size of 5. We will use k=5 to further interpret our clustering result

Fit K-Means clustering for k=5¶

kmeans = KMeans(n_clusters=5)
kmeans.fit(scaled_feature)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

As a result of clustering, we have the clustering label. Let's put these labels back into the original numeric data frame.

len(kmeans.labels_)

30000

data['cluster'] = kmeans.labels_

data.head()

Interpreting Clustering Results¶

Let's see cluster sizes first

plt.figure(figsize=(12,7))
axis = sns.barplot(x=np.arange(0,5,1),y=data.groupby(['cluster']).count()['age'].values)
x=axis.set_xlabel("Cluster Number")
x=axis.set_ylabel("Number of Students")

From the above plot we can see that cluster 0 is the largest and cluster 1 has fewest teen students

Let' see the number of students belonging to each cluster

size_array = list(data.groupby(['cluster']).count()['age'].values)
size_array

[14536, 166, 4784, 9176, 1338]

let's check the cluster statistics

data.groupby(['cluster']).mean()[['basketball', 'football','soccer', 'softball','volleyball','swimming','cheerleading','baseball','tennis','sports','cute','sex','sexy','hot','kissed','dance','band','marching','music','rock','god','church','jesus','bible','hair','dress','blonde','mall','shopping','clothes','hollister','abercrombie','die', 'death','drunk','drugs']]

The cluster center values shows each of the cluster centroids of the coordinates. The row referes to the five clusters,the numbers across each row indicates the cluster’s average value for the interest listed at the top of the column. Positive values are above the overall mean level.

	gradyear	age	friends	basketball	football	soccer	softball	volleyball	swimming	cheerleading	baseball	tennis	sports	cute	sex	sexy	hot	kissed	dance	band	marching	music	rock	god	church	jesus	bible	hair	dress	blonde	mall	shopping	clothes	hollister	abercrombie	die	death	drunk	drugs
count	30000.000000	24914.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.00000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.00000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000
mean	2007.500000	17.993950	30.179467	0.267333	0.252300	0.222767	0.161200	0.143133	0.13440	0.106633	0.104933	0.087333	0.139967	0.322867	0.209400	0.141200	0.126600	0.103200	0.425167	0.299600	0.040600	0.737833	0.243333	0.465300	0.248167	0.112067	0.021333	0.422567	0.110967	0.098933	0.257367	0.353000	0.14850	0.069867	0.051167	0.184100	0.114233	0.087967	0.060433
std	1.118053	7.858054	36.530877	0.804708	0.705357	0.917226	0.739707	0.639943	0.51699	0.514333	0.521726	0.516961	0.471080	0.802441	1.123504	0.528209	0.479145	0.509338	1.162574	1.118786	0.287091	1.252366	0.720375	1.343226	0.834028	0.581709	0.204645	1.097958	0.449436	1.942319	0.695758	0.724391	0.47264	0.346779	0.279555	0.624516	0.436796	0.399125	0.345522
min	2006.000000	3.086000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	2006.750000	16.312000	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	2007.500000	17.287000	20.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	2008.250000	18.259000	44.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	2009.000000	106.927000	830.000000	24.000000	15.000000	27.000000	17.000000	14.000000	31.00000	9.000000	16.000000	15.000000	12.000000	18.000000	114.000000	18.000000	10.000000	26.000000	30.000000	66.000000	11.000000	64.000000	21.000000	79.000000	44.000000	30.000000	11.000000	37.000000	9.000000	327.000000	12.000000	11.000000	8.00000	9.000000	8.000000	22.000000	14.000000	8.000000	16.000000

	gradyear	gender	age	friends	football	soccer	softball	volleyball	swimming	cheerleading	baseball	tennis	sports	cute	sex	sexy	hot	kissed	dance	band	marching	music	rock	god	church	jesus	bible	hair	dress	blonde	mall	shopping	clothes	hollister	abercrombie	die	death	drunk	drugs
0	2006	M	18.982	7	-0.357697	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	-0.402362	-0.186384	-0.267323	-0.264225	-0.202619	0.494457	-0.267795	-0.141421	-0.589161	-0.337793	-0.346411	-0.297557	-0.192654	-0.104247	-0.384873	-0.246906	-0.050937	-0.369915	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	-0.220403	-0.174908
1	2006	F	18.801	0	1.060049	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	0.843856	-0.186384	-0.267323	-0.264225	-0.202619	-0.365718	-0.267795	-0.141421	1.007842	2.438587	0.398078	-0.297557	-0.192654	-0.104247	5.079910	8.653277	-0.050937	1.067392	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	-0.220403	-0.174908
2	2006	M	18.335	69	1.060049	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	-0.402362	-0.186384	-0.267323	-0.264225	-0.202619	-0.365718	1.519887	-0.141421	0.209341	-0.337793	-0.346411	-0.297557	-0.192654	-0.104247	-0.384873	-0.246906	-0.050937	-0.369915	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	2.027908	-0.220403	-0.174908
3	2006	F	18.875	0	-0.357697	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	0.843856	-0.186384	-0.267323	-0.264225	-0.202619	-0.365718	-0.267795	-0.141421	-0.589161	1.050397	-0.346411	-0.297557	-0.192654	-0.104247	-0.384873	-0.246906	-0.050937	-0.369915	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	-0.220403	-0.174908
4	2006	not disclosed	18.995	10	-0.357697	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	-0.402362	0.703703	-0.267323	-0.264225	9.614211	0.494457	0.626046	-0.141421	1.806344	-0.337793	0.398078	-0.297557	-0.192654	-0.104247	0.525925	-0.246906	-0.050937	-0.369915	2.273673	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	2.285122	2.719316

	gradyear	gender	age	friends	football	soccer	softball	volleyball	swimming	cheerleading	baseball	tennis	sports	cute	sex	sexy	hot	kissed	dance	band	marching	music	rock	god	church	jesus	bible	hair	dress	blonde	mall	shopping	clothes	hollister	abercrombie	die	death	drunk	drugs
0	2006	1	18.982	7	-0.357697	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	-0.402362	-0.186384	-0.267323	-0.264225	-0.202619	0.494457	-0.267795	-0.141421	-0.589161	-0.337793	-0.346411	-0.297557	-0.192654	-0.104247	-0.384873	-0.246906	-0.050937	-0.369915	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	-0.220403	-0.174908
1	2006	2	18.801	0	1.060049	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	0.843856	-0.186384	-0.267323	-0.264225	-0.202619	-0.365718	-0.267795	-0.141421	1.007842	2.438587	0.398078	-0.297557	-0.192654	-0.104247	5.079910	8.653277	-0.050937	1.067392	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	-0.220403	-0.174908
2	2006	1	18.335	69	1.060049	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	-0.402362	-0.186384	-0.267323	-0.264225	-0.202619	-0.365718	1.519887	-0.141421	0.209341	-0.337793	-0.346411	-0.297557	-0.192654	-0.104247	-0.384873	-0.246906	-0.050937	-0.369915	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	2.027908	-0.220403	-0.174908
3	2006	2	18.875	0	-0.357697	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	0.843856	-0.186384	-0.267323	-0.264225	-0.202619	-0.365718	-0.267795	-0.141421	-0.589161	1.050397	-0.346411	-0.297557	-0.192654	-0.104247	-0.384873	-0.246906	-0.050937	-0.369915	-0.487314	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	-0.220403	-0.174908
4	2006	3	18.995	10	-0.357697	-0.242874	-0.217928	-0.22367	-0.259971	-0.207327	-0.201131	-0.168939	-0.297123	-0.402362	0.703703	-0.267323	-0.264225	9.614211	0.494457	0.626046	-0.141421	1.806344	-0.337793	0.398078	-0.297557	-0.192654	-0.104247	0.525925	-0.246906	-0.050937	-0.369915	2.273673	-0.314198	-0.201476	-0.183032	-0.294793	-0.261530	2.285122	2.719316

	basketball	football	soccer	softball	volleyball	swimming	cheerleading	baseball	tennis	sports	cute	sex	sexy	hot	kissed	dance	band	marching	music	rock	god	church	jesus	bible	hair	dress	blonde	mall	shopping	clothes	hollister	abercrombie	die	death	drunk	drugs
cluster
0	0.223308	0.229018	0.191387	0.121423	0.109177	0.115094	0.085718	0.091566	0.080627	0.127683	0.272565	0.206728	0.127408	0.108214	0.096313	0.362548	0.269813	0.035498	0.680930	0.221863	0.409741	0.200812	0.091428	0.018368	0.395707	0.095487	0.080696	0.218148	0.296574	0.135182	0.054554	0.039970	0.177353	0.101679	0.084067	0.064461
1	0.313253	0.253012	0.283133	0.210843	0.228916	0.216867	0.180723	0.138554	0.090361	0.126506	0.457831	0.222892	0.186747	0.132530	0.168675	0.463855	0.361446	0.060241	0.692771	0.192771	0.632530	0.240964	0.090361	0.006024	0.674699	0.174699	0.138554	0.325301	0.596386	0.198795	0.126506	0.150602	0.216867	0.168675	0.126506	0.054217
2	0.327968	0.283027	0.275920	0.243311	0.182901	0.156982	0.137542	0.127090	0.094064	0.151965	0.392768	0.183110	0.164089	0.149875	0.115385	0.504599	0.316054	0.045778	0.783445	0.277383	0.544732	0.321697	0.145694	0.023829	0.445443	0.131271	0.103261	0.315635	0.441054	0.168060	0.095945	0.075460	0.176003	0.124373	0.086747	0.059156
3	0.287816	0.266674	0.241173	0.166412	0.166957	0.149847	0.108217	0.109525	0.093723	0.153226	0.340562	0.224935	0.143091	0.131866	0.105166	0.460985	0.339255	0.045554	0.797188	0.258391	0.481909	0.262969	0.120205	0.023540	0.435266	0.117371	0.119551	0.263840	0.373474	0.155405	0.070619	0.048823	0.195292	0.125000	0.089799	0.056670
4	0.382661	0.296712	0.239910	0.257848	0.195815	0.147235	0.203288	0.135277	0.091928	0.141256	0.481315	0.224215	0.190583	0.206278	0.112855	0.571001	0.284753	0.041106	0.791480	0.257848	0.650224	0.399103	0.162930	0.031390	0.514200	0.154709	0.135277	0.422272	0.480568	0.169656	0.130792	0.089686	0.205531	0.133782	0.117339	0.047833