Welcome back!
This is part 2 of a multi-part post, where we look into Daniel Bourke’s YouTube channel.
The question asked is: what is the popularity of Daniel Bourke’s machine learning related videos compared to his other content.
In the first post, we used the YouTube Data API v3 to pull statistics from all of his videos that he has uploaded.
We have this data sitting in a pandas dataframe ready for us to work on.

We are going to work on this data and build some visualisations, like the one below, to answer this question.

I have also included this on GitHub.
Exploring the Data
We have the dataframe, now its time to explore the different columns using df.dtypes
. Here is the output:
kind object
etag object
id object
snippet.publishedAt object
snippet.channelId object
snippet.title object
snippet.description object
snippet.thumbnails.default.url object
snippet.thumbnails.default.width int64
snippet.thumbnails.default.height int64
snippet.thumbnails.medium.url object
snippet.thumbnails.medium.width int64
snippet.thumbnails.medium.height int64
snippet.thumbnails.high.url object
snippet.thumbnails.high.width int64
snippet.thumbnails.high.height int64
snippet.thumbnails.standard.url object
snippet.thumbnails.standard.width float64
snippet.thumbnails.standard.height float64
snippet.thumbnails.maxres.url object
snippet.thumbnails.maxres.width float64
snippet.thumbnails.maxres.height float64
snippet.channelTitle object
snippet.categoryId object
snippet.liveBroadcastContent object
snippet.localized.title object
snippet.localized.description object
contentDetails.duration object
contentDetails.dimension object
contentDetails.definition object
contentDetails.caption object
contentDetails.licensedContent bool
contentDetails.projection object
statistics.viewCount object
statistics.likeCount object
statistics.dislikeCount object
statistics.favoriteCount object
statistics.commentCount object
snippet.tags object
snippet.defaultAudioLanguage object
contentDetails.regionRestriction.blocked object
dtype: object
With the next segment of code will select the relevant columns. We will also convert the data types to something we can play around with using astypes()
.
df1 = df[['snippet.title', 'snippet.tags', 'contentDetails.duration',
'statistics.viewCount', 'statistics.likeCount', 'statistics.dislikeCount',
'statistics.commentCount', 'snippet.publishedAt']]
df1 = df1.astype({'statistics.viewCount': 'float'})
df1 = df1.astype({'statistics.likeCount': 'float'})
df1 = df1.astype({'statistics.dislikeCount': 'float'})
df1 = df1.astype({'statistics.commentCount': 'float'})
Furthermore, there is time data that also needs to be converted to a workable format.
df1['snippet.publishedAt_REFORMATED'] = df1['snippet.publishedAt'].apply(lambda x : datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ'))
df1['contentDetails.duration_REFORMATED'] = df1['contentDetails.duration'].apply(lambda x : isodate.parse_duration(x)) # Turns ISO8061 into duration that can be utilised by python
Now we can check on our newly converted columns with df1.dtypes
.
snippet.title object
snippet.tags object
contentDetails.duration object
statistics.viewCount float64
statistics.likeCount float64
statistics.dislikeCount float64
statistics.commentCount float64
snippet.publishedAt object
snippet.description object
snippet.publishedAt_REFORMATED datetime64[ns]
contentDetails.duration_REFORMATED timedelta64[ns]
dtype: object
Let’s explore the distribution of view counts using the matplotlib
library. We are also going to use the seaborn
library, which makes the graph more appealing.
views = df1['statistics.viewCount']
sns.set(style='darkgrid')
plt.figure(figsize=(15,10))
ax = sns.distplot(views, kde=False)
ax.set(title='Distribution of View Counts', xlabel='View Count', ylabel='Number of Videos')

This graph here shows how difficult it is to be a content creator. The large majority of Daniel Bourke’s videos do not make it over the 10,000 view mark.
We can further confirm this.
vidsOver10000views = (df1['statistics.viewCount']<10000).sum()
totalVids = len(listOfVideo_IDs)
percentage = round1)
Output:
Videos under 10,000 views: 217 (86.11%)
Still videos that get close to 10,000 are by no means a ‘flop’. Here we can see the distribution of the videos below the 10,000 view count if we are really bored.
viewsU10000 = df1['statistics.viewCount'][df1['statistics.viewCount']<10000]
sns.set(style='darkgrid')
plt.figure(figsize=(15,10))
ax = sns.distplot(viewsU10000, kde=False)
ax.set(title='Videos under 10,000 views',xlabel='View Count', ylabel='Number of Videos')

Classifying the Videos
We are getting a little carried away.
Now we need a way to classify the videos into machine learning and not machine learning videos. Let’s have a quick look at the video tags.
df['snippet.tags'].head()
And here is the output:
0 NaN
1 [self supervised learning machine learning, ma...
2 NaN
3 [ken jee, daniel bourke, what questions get as...
4 [machine learning field guide, machine learnin...
Name: snippet.tags, dtype: object
Most of the tags are in lists. Some are empty or NaN
.
We can expand lists and count the most frequent tags. The way we do this is my using apply(pd.Series)
, which expands the tags out into a dataframe. Next we use stack()
to turn the row of each frame into its own column. reset_index()
as its name suggests resets the index. Passing drop=True
prevents the previous index becoming a column. Finally, we use value_counts()
to count the number of times the tag occurs.
tags = df['snippet.tags'].apply(pd.Series).stack().reset_index(drop=True).value_counts()
tagsAbove10 = tags[lambda x : x > 10]
tagsAbove10
This outputs:
daniel bourke 95
fitness 58
machine learning engineer 56
learning 52
life 51
vlog 51
podcast 51
lessons 49
machine 49
entertainment 49
Daniel Bourke 48
The Daniel Bourke Show 47
college 47
university 47
success 47
fail 47
The 46
Daniel 46
public 46
Bourke 46
drop out 46
speaking 46
Show 46
machine learning 44
mrdbourke 34
artificial intelligence 29
udacity 24
deep learning 23
code 19
data science 17
python 16
coursera 16
coding 16
100 days of code 16
udacity nanodegree 16
Udacity Artificial Intelligence Nanodegree 13
AI 13
nanodegree 13
data scientist 12
udacity deep learning nanodegree 12
ai 11
dtype: int64
Putting this all together we get a series with the list of tags in descending order, which we can graph.
sns.set(style='darkgrid')
plt.figure(figsize=(15,10))
ax = sns.barplot(tagsAbove10.index, tagsAbove10.values)
ax.set(title='Tags',xlabel='Tags', ylabel='Number of mentions')
ticks = ax.set_xticklabels(ax.get_xticklabels(),rotation=80) # if not assigned to a variable, then prints lots of text

By visualising the tags, we can create a list of words that, if present in the title or in the tags, will deem the video a ‘machine learning topic’.
pat = r'.*(machine learning|data science|artificial intelligence|deep learning|code|machine|python|coding|code|udacity|coursera|data scientist|tensorflow).*$'
df1['isMachineLearning'] = df1['snippet.tags'].map(lambda x : bool(re.search(pat, str(x).lower())))|df1['snippet.title'].map(lambda x : bool(re.search(pat,
str(x).lower())))
Here we’ve got a really long regular expression string. This will check if the above words are present in the video’s tags or title. This uses a combination of a lambda
function, which creates a one line function; the map()
function, which applies the lambda
function to every single cell. The re.search()
function will see if there is a match. bool()
will ensure true
or false
will be returned and lower()
is applied to avoid differing upper and lower cases. We can create a new column within the same dataframe called isMachineLearning
.
After all this, we can check how many NaN
‘s we have.
NaNs = df1['isMachineLearning'].isna().sum()
Total_MLvids = df1['isMachineLearning'].sum()
Total_nonMLvids = (df1['isMachineLearning']==False).sum()
Total = Total_MLvids + Total_nonMLvids
print('Total NaN: ', NaNs)
print('Total ML videos: ', Total_MLvids)
print('Total non-ML videos: ', Total_nonMLvids)
print('Videos in total: ', Total)
print('Percentage of ML videos: {}%'.format(str(round2))d
Here is the output:
Total NaN: 0
Total ML videos: 174
Total non-ML videos: 78
Videos in total: 252
Percentage of ML videos: 69.05%
A large majority of videos Daniel Bourke makes are related to machine learning. However, this assumes the method we used to classify the video works perfectly (it probably doesn’t, and there is probably a better way to do it, so please comment your way).
Visualising the Difference
Let’s plot these two groups in a box plot and see if we can build a meaningful visualisation. I’ve also added some line of code which will make the titles and axis labels a bit easier to see.
MLds = df1[df1['isMachineLearning']==True]['statistics.viewCount'] # create dataseries of view counts for ML related videos
nonMLds = df1[df1['isMachineLearning']==False]['statistics.viewCount']
d = {'ML related': MLds.values, 'non-ML related': nonMLds.values} # creating a dictionary to title data
d = pd.DataFrame.from_dict(d, orient='index') # dataseries have different shape or 'sizes', pandas will add NaN to make series same size
data = d.transpose()
plt.figure(figsize=(15,10))
sns.set(style='darkgrid')
ax = sns.boxplot(data=data, orient='h')
ax.set_title("Daniel Bourke's YouTube Comparison", fontsize=30)
ax.set_xlabel('Natural Log of Video Count', fontsize=20)
ax.set_ylabel('Video Type', fontsize=20)
plt.show()

Hmmm… it’s really hard to see what is going on here because the data is so compressed to one end. Like we found out before, most videos aren’t going to make the 10,000 view count. So most videos sit at the lower end while some videos accumulate a lot more videos, a lot more!
This is because view count has an exponential pattern – a ‘viral’ effect. This means we need a way of linearising the data. We can use this by applying a logarithm. The way we can do this is by using np.nlog()
. I’m going to use the same code as above but change two lines: adding the logarithm as well as changing the axis label. We are also going to add a swarmplot, which represents each video as a dot.
d = {'ML related': np.log(MLds.values), 'non-ML related': np.log(nonMLds.values)} # applying natural log
# more code
ax.set_xlabel('Natural Log of View Count', fontsize=20)
#more code
ax = sns.swarmplot(data=data, orient='h', color='0.4')
plt. show()

Much better, don’t you think?
We can see the machine learning related videos, though have a bigger spread, on average, have more views.
Finally, another graph to visual this difference is a histogram / distribution. The two groups here have different number of videos. The next plot normalises the data.
d = {'ML related': np.log(MLds.values), 'non-ML related': np.log(nonMLds.values)}
d = pd.DataFrame.from_dict(d, orient='index')
data = d.transpose()
plt.figure(figsize=(15,10))
sns.set(style='darkgrid')
ax = sns.distplot(data['ML related'])
ax = sns.distplot(data['non-ML related'])
ax.set_title("Daniel Bourke's YouTube Comparison", fontsize=30)
ax.set_xlabel('Natural Log of View Count', fontsize=20)
ax.set_ylabel('Proportion', fontsize=20)
plt.legend(data)
plt.show()

An interesting fact is that Daniel Bourke’s most popular video is not a machine learning one.
df1.sort_values(by='statistics.viewCount', ascending=False).head(2)

The final graph I want to use is a time-series regression to show how to videos from the two groups fare over time.
We are going to plot the view counts over their publish dates. We are then going to use the sklearn
to produce a linear regression plot or a best-fit line.
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def perform_reg(X, Y):
model = LinearRegression()
L = len(X.values)
X = X.values.reshape(-1,1)
Y = Y
model.fit(X, Y)
return model.predict(X)
d = {'ML related': MLdf, 'non-ML related': nonMLdf}
df1['statistics.viewCount_LOG'] = df1['statistics.viewCount'].apply(lambda x : np.log(x))
MLdf = df1[df1['isMachineLearning']==True]
nonMLdf = df1[df1['isMachineLearning']==False]
MLdf['statistics.viewCount_REG'] = perform_reg(MLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), MLdf['statistics.viewCount_LOG'])
nonMLdf['statistics.viewCount_REG'] = perform_reg(nonMLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), nonMLdf['statistics.viewCount_LOG'])
plt.figure(figsize=(15,10))
sns.set(style='darkgrid')
ax = sns.scatterplot(data=MLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_LOG')
ax = sns.scatterplot(data=nonMLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_LOG')
ax = sns.lineplot(data=MLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_REG')
ax = sns.lineplot(data=nonMLdf,x='snippet.publishedAt_REFORMATED', y='statistics.viewCount_REG')
ax.set_title("Time Series of YouTube Videos", fontsize=30)
ax.set_xlabel('Dates Pubished', fontsize=20)
ax.set_ylabel('Natural Log of View Count', fontsize=20)
plt.legend(d)
plt.show()

Voilà! We can see the general increase in views of Daniel Bourke channel as he gains more subscribers. We can also see the hard work puts into his channel over the years. We can see his consistency as well, which is key.
We can also see that, roughly, the machine learning videos tend to perform better than the those that are not. The slope quantifies this, and the Pearson correlation coefficient can rate how well we can go by this.
import scipy.stats
# calculating the slope
resultsML = scipy.stats.linregress(MLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), MLdf['statistics.viewCount_LOG'])
resultsNonML = scipy.stats.linregress(nonMLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), nonMLdf['statistics.viewCount_LOG'])
print('ML video slope: ' + str(resultsML.slope).rjust(45))
print('Non-ML related video slope: ' + str(resultsNonML.slope).rjust(33))
print(' ')
ML_r = scipy.stats.pearsonr(MLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), MLdf['statistics.viewCount_LOG'])[0]
nonML_r = scipy.stats.pearsonr(nonMLdf['snippet.publishedAt_REFORMATED'].map(datetime.toordinal), nonMLdf['statistics.viewCount_LOG'])[0]
# calculating the coefficient
print('ML video correlation coefficient: ' + str(ML_r).rjust(24))
print('Non-ML video correlation coefficient: ' + str(nonML_r).rjust(20))
And here we have the output:
ML video slope: 0.0027619171204583703
Non-ML related video slope: 0.0015117478413339974
ML video correlation coefficient: 0.5593699023261571
Non-ML video correlation coefficient: 0.4881345759602963
We can see the slope for machine learning videos is greater than the non-machine learning content. Please excuse the small numbers. This is because the time is converted into days and these tend to be large numbers compared to the logarithm of view count.
Another thing to note is the correlation coefficient, which indicate the data is pretty random or noisy. There is a correlation, albeit quite weak.
Final Notes
It is safe to say that Daniel Bourke’s machine learning videos are more popular compared to those about health and fitness. After all, this is what he is known for.
On the other hand, should he focus on just machine learning videos for the views? I hope not. Youtubers too often give in to their audience wants instead of publishing what they want to.
Health and fitness is what keeps Daniel Bourke unique. Additionally, the message he spreads is a positive one. In this case, the numbers do not really matter. If he helps one person, that in itself is a victory.
What next?
This is my first data related project. My knowledge is sparse. This would probably be more of a statistical analysis rather than training and implementing a model. But beginnings are humbling.
In the future, as my knowledge hopefully grows, I would like the implement a away of classifying youtube videos rather than a regex match. One idea would be be combining the video title, description, tags, and comments and then use clustering to pick up more than just machine learning and non-machine learning as a topic.
Another limitation by my skill is lack of statistical tests. We can easily see a difference in the video types but is this statistical significant? At the moment, I will retire this project, but it can always be dusted off in the future.
Conclusion
In this post, we went over analysis of Daniel Bourke’s YouTube videos. The question we had was how did his machine learning related videos compare to his other content.
First, we had to classify the YouTube videos into machine learning and non-machine learning video.
A box plot showed a difference in view count between the two groups. A logarithm had to be applied for this to make sense, since views have a ‘viral’ effect.
Finally, we used regression to plotted the views against time to see growth over time.
Thank you reading this. I am open to any feedback so please comment below if you have any tips
Courtesy has to go to Daniel Bourke for giving me full permission to go all out with his YouTube data.
References
2 thoughts on “My First Data Project: Exploring Daniel Bourke’s Youtube Channel (Part 2)”
A fascinating project and a very detailed explanation of how to implement this. Thanks!
Thank you Tim.