Daniel Bourke
Shivan

Shivan

My First Data Project: Exploring Daniel Bourke’s Youtube Channel (Part 2)

Print Friendly, PDF & Email

Welcome back!

This is part 2 of a multi-part post, where we look into Daniel Bourke’s YouTube channel.

The question asked is: what is the popularity of Daniel Bourke’s machine learning related videos compared to his other content.

In the first post, we used the YouTube Data API v3 to pull statistics from all of his videos that he has uploaded.

We have this data sitting in a pandas dataframe ready for us to work on.

Data Science

We are going to work on this data and build some visualisations, like the one below, to answer this question.

I have also included this on GitHub.

Exploring the Data

We have the dataframe, now its time to explore the different columns using df.dtypes. Here is the output:

kind                                         object
etag                                         object
id                                           object
snippet.publishedAt                          object
snippet.channelId                            object
snippet.title                                object
snippet.description                          object
snippet.thumbnails.default.url               object
snippet.thumbnails.default.width              int64
snippet.thumbnails.default.height             int64
snippet.thumbnails.medium.url                object
snippet.thumbnails.medium.width               int64
snippet.thumbnails.medium.height              int64
snippet.thumbnails.high.url                  object
snippet.thumbnails.high.width                 int64
snippet.thumbnails.high.height                int64
snippet.thumbnails.standard.url              object
snippet.thumbnails.standard.width           float64
snippet.thumbnails.standard.height          float64
snippet.thumbnails.maxres.url                object
snippet.thumbnails.maxres.width             float64
snippet.thumbnails.maxres.height            float64
snippet.channelTitle                         object
snippet.categoryId                           object
snippet.liveBroadcastContent                 object
snippet.localized.title                      object
snippet.localized.description                object
contentDetails.duration                      object
contentDetails.dimension                     object
contentDetails.definition                    object
contentDetails.caption                       object
contentDetails.licensedContent                 bool
contentDetails.projection                    object
statistics.viewCount                         object
statistics.likeCount                         object
statistics.dislikeCount                      object
statistics.favoriteCount                     object
statistics.commentCount                      object
snippet.tags                                 object
snippet.defaultAudioLanguage                 object
contentDetails.regionRestriction.blocked     object
dtype: object

With the next segment of code will select the relevant columns. We will also convert the data types to something we can play around with using astypes().

df1 = df[['snippet.title', 'snippet.tags', 'contentDetails.duration',
       'statistics.viewCount', 'statistics.likeCount', 'statistics.dislikeCount',
        'statistics.commentCount', 'snippet.publishedAt']]

df1 = df1.astype({'statistics.viewCount': 'float'})
df1 = df1.astype({'statistics.likeCount': 'float'})
df1 = df1.astype({'statistics.dislikeCount': 'float'})
df1 = df1.astype({'statistics.commentCount': 'float'})

Furthermore, there is time data that also needs to be converted to a workable format.

df1['snippet.publishedAt_REFORMATED'] = df1['snippet.publishedAt'].apply(lambda x : datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ'))
df1['contentDetails.duration_REFORMATED'] = df1['contentDetails.duration'].apply(lambda x : isodate.parse_duration(x)) # Turns ISO8061 into duration that can be utilised by python

Now we can check on our newly converted columns with df1.dtypes.

snippet.title                                  object
snippet.tags                                   object
contentDetails.duration                        object
statistics.viewCount                          float64
statistics.likeCount                          float64
statistics.dislikeCount                       float64
statistics.commentCount                       float64
snippet.publishedAt                            object
snippet.description                            object
snippet.publishedAt_REFORMATED         datetime64[ns]
contentDetails.duration_REFORMATED    timedelta64[ns]
dtype: object

Let’s explore the distribution of view counts using the matplotlib library. We are also going to use the seaborn library, which makes the graph more appealing.

views = df1['statistics.viewCount']

sns.set(style='darkgrid')
plt.figure(figsize=(15,10))

ax = sns.distplot(views, kde=False)
ax.set(title='Distribution of View Counts', xlabel='View Count', ylabel='Number of Videos')
data science

This graph here shows how difficult it is to be a content creator. The large majority of Daniel Bourke’s videos do not make it over the 10,000 view mark.

We can further confirm this.

vidsOver10000views = (df1['statistics.viewCount']<10000).sum()
totalVids = len(listOfVideo_IDs)
percentage = round1)

Output:

Videos under 10,000 views: 217 (86.11%)

Still videos that get close to 10,000 are by no means a ‘flop’. Here we can see the distribution of the videos below the 10,000 view count if we are really bored.

viewsU10000 = df1['statistics.viewCount'][df1['statistics.viewCount']<10000]
sns.set(style='darkgrid')
plt.figure(figsize=(15,10))

ax = sns.distplot(viewsU10000, kde=False)
ax.set(title='Videos under 10,000 views',xlabel='View Count', ylabel='Number of Videos')

Classifying the Videos

We are getting a little carried away.

Now we need a way to classify the videos into machine learning and not machine learning videos. Let’s have a quick look at the video tags.

df['snippet.tags'].head()

And here is the output:

0                                                  NaN
1    [self supervised learning machine learning, ma...
2                                                  NaN
3    [ken jee, daniel bourke, what questions get as...
4    [machine learning field guide, machine learnin...
Name: snippet.tags, dtype: object

Most of the tags are in lists. Some are empty or NaN.

We can expand lists and count the most frequent tags. The way we do this is my using apply(pd.Series), which expands the tags out into a dataframe. Next we use stack() to turn the row of each frame into its own column. reset_index() as its name suggests resets the index. Passing drop=True prevents the previous index becoming a column. Finally, we use value_counts() to count the number of times the tag occurs.

tags = df['snippet.tags'].apply(pd.Series).stack().reset_index(drop=True).value_counts()

tagsAbove10 = tags[lambda x : x > 10]
tagsAbove10

This outputs:

daniel bourke                                 95
fitness                                       58
machine learning engineer                     56
learning                                      52
life                                          51
vlog                                          51
podcast                                       51
lessons                                       49
machine                                       49
entertainment                                 49
Daniel Bourke                                 48
The Daniel Bourke Show                        47
college                                       47
university                                    47
success                                       47
fail                                          47
The                                           46
Daniel                                        46
public                                        46
Bourke                                        46
drop out                                      46
speaking                                      46
Show                                          46
machine learning                              44
mrdbourke                                     34
artificial intelligence                       29
udacity                                       24
deep learning                                 23
code                                          19
data science                                  17
python                                        16
coursera                                      16
coding                                        16
100 days of code                              16
udacity nanodegree                            16
Udacity Artificial Intelligence Nanodegree    13
AI                                            13
nanodegree                                    13
data scientist                                12
udacity deep learning nanodegree              12
ai                                            11
dtype: int64

Putting this all together we get a series with the list of tags in descending order, which we can graph.

sns.set(style='darkgrid')
plt.figure(figsize=(15,10))

ax = sns.barplot(tagsAbove10.index, tagsAbove10.values)
ax.set(title='Tags',xlabel='Tags', ylabel='Number of mentions')
ticks = ax.set_xticklabels(ax.get_xticklabels(),rotation=80) # if not assigned to a variable, then prints lots of text

By visualising the tags, we can create a list of words that, if present in the title or in the tags, will deem the video a ‘machine learning topic’.

pat = r'.*(machine learning|data science|artificial intelligence|deep learning|code|machine|python|coding|code|udacity|coursera|data scientist|tensorflow).*$'

df1['isMachineLearning'] = df1['snippet.tags'].map(lambda x : bool(re.search(pat, str(x).lower())))|df1['snippet.title'].map(lambda x : bool(re.search(pat,
                                                           str(x).lower())))

Here we’ve got a really long regular expression string. This will check if the above words are present in the video’s tags or title. This uses a combination of a lambda function, which creates a one line function; the map() function, which applies the lambda function to every single cell. The re.search() function will see if there is a match. bool() will ensure true or false will be returned and lower() is applied to avoid differing upper and lower cases. We can create a new column within the same dataframe called isMachineLearning.

After all this, we can check how many NaN ‘s we have.

NaNs = df1['isMachineLearning'].isna().sum()
Total_MLvids =  df1['isMachineLearning'].sum()
Total_nonMLvids = (df1['isMachineLearning']==False).sum()
Total = Total_MLvids + Total_nonMLvids

print('Total NaN: ', NaNs)
print('Total ML videos: ', Total_MLvids)
print('Total non-ML videos: ', Total_nonMLvids)
print('Videos in total: ', Total)

print('Percentage of ML videos: {}%'.format(str(round2))d

Here is the output:

Total NaN:  0
Total ML videos:  174
Total non-ML videos:  78
Videos in total:  252
Percentage of ML videos: 69.05%

A large majority of videos Daniel Bourke’s makes are related to machine learning. However, this assumes the method we used to classify the video works perfectly (it probably doesn’t, and there is a probably better way to do it, so please comment and tell how your would).

Visualising the Difference

Let’s plot these two groups in a box plot and see if we can build a meaningful visualisation. I’ve also added some line of code which will make the titles and axis labels a bit easier to see.

MLds = df1[df1['isMachineLearning']==True]['statistics.viewCount'] # create dataseries of view counts for ML related videos
nonMLds = df1[df1['isMachineLearning']==False]['statistics.viewCount']

d = {'ML related': MLds.values, 'non-ML related': nonMLds.values} # creating a dictionary to title data
d = pd.DataFrame.from_dict(d, orient='index') # dataseries have different shape or 'sizes', pandas will add NaN to make series same size

data = d.transpose()

plt.figure(figsize=(15,10))
sns.set(style='darkgrid')
ax = sns.boxplot(data=data, orient='h')

ax.set_title("Daniel Bourke's YouTube Comparison", fontsize=30)
ax.set_xlabel('Natural Log of Video Count', fontsize=20)
ax.set_ylabel('Video Type', fontsize=20)

plt.show()

Hmmm… it’s really hard to see what is going on here because the data is so compressed to one end. Like we found out before, most videos aren’t going to make the 10,000 view count. So most videos sit at the lower end while some videos accumulate a lot more videos, a lot more!

This is because view count has an exponential pattern – a ‘viral’ effect. This means we need a way of linearising the data. We can use this by applying a logarithm. The way we can do this is by using np.nlog(). I’m going to use the same code as above but change two lines: adding the logarithm as well as changing the axis label. We are also going to add a swarmplot, which represents each video as a dot.

d = {'ML related': np.log(MLds.values), 'non-ML related': np.log(nonMLds.values)} # applying natural log

# more code

ax.set_xlabel('Natural Log of View Count', fontsize=20)

#more code

ax = sns.swarmplot(data=data, orient='h', color='0.4')

plt. show()

Much better, don’t you think?

We can see the machine learning related videos, though have a bigger spread, on average, have more views.

Finally, another graph to visual this difference is a histogram / distribution. The two groups here have different number of videos. The next plot normalises the data.

d = {'ML related': np.log(MLds.values), 'non-ML related': np.log(nonMLds.values)}
d = pd.DataFrame.from_dict(d, orient='index')

data = d.transpose()

plt.figure(figsize=(15,10))
sns.set(style='darkgrid')

ax = sns.distplot(data['ML related'])
ax = sns.distplot(data['non-ML related'])

ax.set_title("Daniel Bourke's YouTube Comparison", fontsize=30)
ax.set_xlabel('Natural Log of View Count', fontsize=20)
ax.set_ylabel('Proportion', fontsize=20)

plt.legend(data)

plt.show()

An interesting fact is that Daniel Bourke’s most popular video is not a machine learning one.

df1.sort_values(by='statistics.viewCount', ascending=False).head(2)

The final graph I want to use is a time-series regression to show how to videos from the two groups fare over time.

We are going to plot the view counts over their publish dates. We are then going to use the sklearn to produce a linear regression plot or a best-fit line.

from sklearn.linear_model import LinearRegression
from sklearn import metrics

def perform_reg(X, Y):
    model = LinearRegression()
    L = len(X.values)
    X = X.values.reshape(-1,1)
    Y = Y
    model.fit(X, Y)
    return model.predict(X)

d = {'ML related': MLdf, 'non-ML related': nonMLdf}

df1['statistics.viewCount_LOG'] = df1['statistics.viewCount'].apply(lambda x : np.log(x))
MLdf = df1[df1['isMachineLearning']==True]
nonMLdf = df1[df1['isMachineLearning']==False]

MLdf['statistics.viewCount_REG'] = perform_reg(MLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), MLdf['statistics.viewCount_LOG'])
nonMLdf['statistics.viewCount_REG'] = perform_reg(nonMLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), nonMLdf['statistics.viewCount_LOG'])

plt.figure(figsize=(15,10))
sns.set(style='darkgrid')

ax = sns.scatterplot(data=MLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_LOG')
ax = sns.scatterplot(data=nonMLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_LOG')
ax = sns.lineplot(data=MLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_REG')
ax = sns.lineplot(data=nonMLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_REG')

ax.set_title("Time Series of YouTube Videos", fontsize=30)
ax.set_xlabel('Dates Pubished', fontsize=20)
ax.set_ylabel('Natural Log of View Count', fontsize=20)

plt.legend(d)

plt.show()

Voilà! We can see the general increase in views of Daniel Bourke channel as he gains more subscribers. We can also see the hard work puts into his channel over the years. We can see his consistency as well, which is key.

We can also see that, roughly, the machine learning videos tend to perform better than the those that are not. The slope quantifies this, and the Pearson correlation coefficient can rate how well we can go by this.

import scipy.stats

# calculating the slope
resultsML = scipy.stats.linregress(MLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), MLdf['statistics.viewCount_LOG'])
resultsNonML = scipy.stats.linregress(nonMLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), nonMLdf['statistics.viewCount_LOG'])

print('ML video slope: '  + str(resultsML.slope).rjust(45))
print('Non-ML related video slope: ' + str(resultsNonML.slope).rjust(33))
print(' ')

ML_r = scipy.stats.pearsonr(MLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), MLdf['statistics.viewCount_LOG'])[0]
nonML_r = scipy.stats.pearsonr(nonMLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), nonMLdf['statistics.viewCount_LOG'])[0]

# calculating the coefficient
print('ML video correlation coefficient: '  + str(ML_r).rjust(24))
print('Non-ML video correlation coefficient: ' + str(nonML_r).rjust(20))

And here we have the output:

ML video slope:                         0.0027619171204583703
Non-ML related video slope:             0.0015117478413339974
 
ML video correlation coefficient:       0.5593699023261571
Non-ML video correlation coefficient:   0.4881345759602963

We can see the slope for machine learning videos is greater than the non-machine learning content. Please excuse the small numbers. This is because the time is converted into days and these tend to be large numbers compared to the logarithm of view count.

Another thing to note is the correlation coefficient, which indicate the data is pretty random or noisy. There is a correlation, albeit quite weak.

Final Notes

It is safe to say that Daniel Bourke’s machine learning videos are more popular compared to those about health and fitness. After all, this is what he is known for.

On the other hand, should he focus on just machine learning videos for the views? I hope not. Youtubers too often give in to their audience wants instead of publishing what they want to.

Health and fitness is what keeps Daniel Bourke unique. Additionally, the message he spreads is a positive one. In this case, the numbers do not really matter. If he helps one person, that in itself is a victory.

What next?

This is my first data related project. My knowledge is sparse. This would probably be more of a statistical analysis rather than training and implementing a model. But beginnings are humbling.

In the future, as my knowledge hopefully grows, I would like the implement a away of classifying youtube videos rather than a regex match. One idea would be be combining the video title, description, tags, and comments and then use clustering to pick up more than just machine learning and non-machine learning as a topic.

Another limitation by my skill is lack of statistical tests. We can easily see a difference in the video types but is this statistical significant? At the moment, I will retire this project, but it can always be dusted off in the future.

Conclusion

In this post, we went over analysis of Daniel Bourke’s YouTube videos. The question we had was how did his machine learning related videos compare to his other content.

First, we had to classify the YouTube videos into machine learning and non-machine learning video.

A box plot showed a difference in view count between the two groups. A logarithm had to be applied for this to make sense, since views have a ‘viral’ effect.

Finally, we used regression to plotted the views against time to see growth over time.

Thank you reading this. I am open to any feedback so please comment below if you have any tips

Courtesy has to go to Daniel Bourke for giving me full permission to go all out with his YouTube data.

References
  1. vidsOver10000views * 100 ) / totalVids,2) print('Videos under 10,000 views: {} ({}%)'.format(str(vidsOver10000views), str(percentage[]
  2. (Total_MLvids*100)/Total),2[]

Related posts

Lynda.com

How You Can Learn Everyday

Reading Time: 2 minutes Everyday, the journey of life is learning, and Lynda.com is your ticket! In our careers, we are measured by our flexibility. What was: work one

Read More »
Reading

Remember Everything You Read

Reading Time: 3 minutes My guru and longtime secret lover, Ali Abdaal released a video on how to remember what he has read. Reading (non-fiction) is an important part

Read More »

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on email
Print Friendly, PDF & Email

2 thoughts on “My First Data Project: Exploring Daniel Bourke’s Youtube Channel (Part 2)”

Leave a Comment

Your email address will not be published. Required fields are marked *