With over 100 million views and one million subscribers, Ali Abdaal is definitely one to advise on building a successful YouTube channel. Besides his advice to aspiring YouTuber to start today, he also preaches about consistency.
And that is where the motivation of this app derives. The question: how consistent is Ali Abdaal’s YouTube channel? Then what about other YouTube channels?
We will see the use of TubeStats to answer this question. We will also go through how it works. I’ll also provide my personal leanings as well as what I plan to do in the future.
Link to app: https://www.tubestats.app
What TubeStats does?
The app takes user input. This can be:
- channel ID (e.g.
- link to the channel (e.g.
- link to a video of the interested channel (e.g.
- video ID (e.g.
After a few moments, it will produce summary statistics for the channel. Examples include total video count and total watch time as shown below.
Here, we also have the channel avatar image as well as a description provided on the channel.
Next, we have a graph that summaries all the videos in the channel.
The plot involves time of post against the natural logarithm of view count. The natural logarithm is used because the data is skewed towards the tail. This is due to the ‘viral’ nature of videos.
The colour of the circles represent the like-dislike ratio of the video. The size of the circles is related to their absolute view count.
The date range can be altered. In this example, I’ve altered the date range because Ali Abdaal started taking his channel more seriously in July 2017.
Another graph shows the number of days between videos. For example, in the dates selected above, the longest period of inactivity was just under 22 days with the “My Favourite iPad Pro Apps” video. From the video’s date, this was around the festive season.
On average, Ali Abdaal is able to put our a video every 3 days with a majority (75%) being released within 5 days.
At the end, TubeStats provides a list of video with the greatest views in the channel along with the videos with the least like-dislike ratio.
How does TubeStats work?
The major working parts:
- YouTube Data API
I’ve also divided this into hurdles and provided my solutions in overcoming them.
Hurdle #1: how do I set up my development environment?
I’m using Linux. Setting up our development environment is essential for our productivity in coding.
We start by creating our directory. Following this, we create a virtual environment to manage our packages without interfering with other projects. Finally, we initialise git to track changes and allow ease of sharing our code to the world.
$ mkdir tubestats $ cd tubestats $ python3 -m venv venv $ source venv/bin/activate $ (venv) $ git init
Hurdle #2: how do we access the video information?
We could scrap every single video for its view count, comment count etc. This would take a long time. Fortunately, we can use the YouTube Data API. This provides easy access to statistics for YouTube videos and channels. To access this, we must set up our Google Cloud Console platform.
From this, we create a ‘New Project’. We activate the YouTube Data API v3 app. Finally, we create credentials (click here for a more in depth). These steps will provide an key to access the YouTube Data API.
Hurdle #3: how do we store passwords and API keys?
We want to share our code but not our passwords and API keys. How do we do this?
We can use a third party package:
This is important so we can access the data locally. We will see how when we push to web serving with Heroku, we have to use a different method.
We can store the key in a
.env file, and we can add this file to the
.gitignore so it doesn’t get shared.
.env file APIKey=xxxAPIKEYherexxx
We install the module that allow access to the the YouTube API. We do this using pip, while having the virtual environment active.
$ (venv) pip install google-api-python-client
Hurdle #4: how do we get the video statistics?
If we have a channel ID, we can use this to obtain a playlist with all IDs of all videos uploaded from the channel. We can then call the video ID and then obtains the statistics we are interested in.
def get_video_data(self) -> pd.core.frame.DataFrame: ... while True: # obtaining video ID + titles playlist_request = self.youtube.playlistItems().list( part='snippet,contentDetails', maxResults=50, # API Limit is 50 pageToken=next_page_token, playlistId=upload_playlist_ID, ) playlist_response = playlist_request. # isolating video ID vid_subset = [ vid_ID['contentDetails']['videoId'] for vid_ID in playlist_response['items'] # retrieving video ID vid_info_subset_request = self.youtube.videos(). part='snippet,contentDetails, id=vid_subset ) vid_info_subset_response = vid_info_subset_request. video_response.append( # obtaining page next_page_token = playlist_response.get('nextPageToken') # get method used because token may not exist if next_page_token is break df = pd.json_normalize(video_response, 'items') return df
Here we call the API, but can only get 50 video IDs at a time. This is thanks to pagination. Every time a call is made, a page token is provided if there are more than 50 videos. The page token points to the next page. If the page token is returns
None that means all IDs have been exhausted. We use a while loop to obtain the video IDs. The while loop stops once we have no more page tokens.
We append this output into a python list, which is a list of strings in JSON format. We store this information in a
DataFrame. We can do this thanks to a built-in function
json_normalize(). From this, we have our dataframe.
Hurdle #5: how to organise the code?
Now that our code is starting to take flight, it’s becoming hard to fit all this code into one file. This is where we use different files and directories for the organisation.
├── data │ ├── channel_data.pkl │ └── video_data.pkl ├── LICENSE ├── Procfile ├── README.MD ├── requirements.txt ├── setup.sh ├── tests │ ├── __init__.py │ ├── test_settings.py │ ├── test_youtube_api.py │ ├── test_youtube_data.py │ └── test_youtube_parser.py ├── tubestats │ ├── __init__.py │ ├── youtube_api.py │ ├── youtube_data.py │ └── youtube_parser.py └── youtube_presenter.py
The main sub-directories to note are
tubestats which contains the python source code to access the API, wrangle the data, and produce underlying graphs to present the data.
tests contain test code to test the
youtube_presenter.py is what presents the code. We see some other files which we will address later on.
Hurdle #6: how to test code?
It is important to ensure that our code works. In this case, I use
pytest. Here is an example of testing the above
from tubestats.youtube_api import create_api, YouTubeAPI from tests.test_settings import set_channel_ID_test_case from pathlib import Path import pytest import googleapiclient import pandas def test_create_api(): youtube = create_api() assert isinstance(youtube, googleapiclient.discovery.Resource) @pytest.fixture() def youtubeapi(): channel_ID = set_channel_ID_test_case() yt = YouTubeAPI(channel_ID) return yt def test_get_video_data(youtubeapi): df = youtubeapi.get_video_data() assert isinstance(df, pandas.core.frame.DataFrame) # saving video data to save API calls for later test BASE_DIR = Path(file).parent.parent df.to_pickle(BASE_DIR / 'data' / 'video_data.pkl')
In this case, we import our source code, along with associated modules. We use the
pytest.fixture() decorator. This allows us to reuse the results from the data we pull in our test case. We ‘pickle’ the data so we can use this instead of making more API calls for other tests.
I can probably do better than
isinstance, but this will have to do.
If we run
pytest in the terminal, this will test the functionality of our code.
Hurdle #7: how to display this data for others to interact?
This is done through the combination of
altair, which provides us with interactive graphs. Also, we use
streamlit to display these graphs and allow interaction.
The following code creates a graph that displays all the videos over time.
import altair as alt def scatter_all_videos(self, df: pd.core.frame.DataFrame) -> alt.vegalite.v4.Chart: df_views = df c = alt.Chart(df_views, title='Plot of videos over time').mark_point().encode( x=alt.X('snippet.publishedAt_REFORMATED:T', axis=alt.Axis(title='Date Published')), y=alt.Y('statistics.viewCount_NLOG:Q', axis=alt.Axis(title='Natural Log of Views')), color=alt.Color('statistics.like-dislike-ratio:Q', scale=alt.Scale(scheme='turbo'), legend=None), tooltip=['snippet.title:N', 'statistics.viewCount:Q', 'statistics.like-dislike-ratio:Q'], size=alt.Size('statistics.viewCount:Q', legend=None) ) return c
Next, we can display this graph with the ability to edit dates.
import streamlit as st def date_slider(date_end=datetime.today()): date_start, date_end = st.slider( 'Select date range to include:', min_value=first_video_date, # first video max_value=last_video_date, #value for date_end value=(first_video_date , last_video_date), #same as min value step=timedelta(days=2), format='YYYY-MM-DD', key=999) return date_start, date_end date_start, date_end = date_slider() transformed_df = youtuber_data.transform_dataframe(date_start=date_start, date_end=date_end) c = youtuber_data.scatter_all_videos(transformed_df) st.altair_chart(c, use_container_width=True)
This is what we see when running the program locally,
streamlit run youtube_presenter.py:
Hurdle #7: how do I show off?
There is no point in this sitting on our computers. We need to share this with the world. Github can be used to share code, but what about the non-technical among us?
The solution is to use Heroku to host.
First, what we will do is create an account on Heroku. A free account will suffice.
The code we have ensured works and we can push this onto GitHub. Next, we create a
requirements.txt file will contain out libraries. We can use the command:
$ (venv) pip freeze > requirements.txt
setup.sh file is created:
mkdir -p ~/.streamlit/echo "\\ [server]\\n\\ headless = true\\n\\ port = $PORT\\n\\ enableCORS = false\\n\\ \\n\\ " > ~/.streamlit/config.toml
Procfile tells Heroku that this is a web application. It also instructs
setup.py to run file as well as run our code using
web: sh setup.sh && streamlit run app.py
Next, we install the we install the Heroku Command Line Interface.
We can then push out code to Heroku just like we would for a GitHub repository.
$ heroku login $ heroku create tubestats $ git push heroku main
And here we are, the final product.
We can also attach a domain to customise the app even more.
Hurdle #8: what if a video ID is inputted?
So far, this only works with channel ID input. What about video ID? a URL? This is where we need to parse the input.
We use a combination of regex and checking lengths.
For example, a channel ID is exactly 24 characters long, a video ID is 11 characters long.
If a link is provided, we can use regular expression (regex, using the
re module) to extract the video ID or channel ID. There are plenty of websites that help with regex. Here is one.
Here is an example of applying regex.
import re LINK_MATCH = r'(^.*youtu)(\\.be|be\\.com)(\\/watch\\?v\\=|\\/)([a-zA-Z0-9_-]+)(\\/)?([a-zA-Z0-9_-]+)?' m = re.search(LINK_MATCH, for_parse) video_id = m.group(4) # video ID if video_id == 'channel': return m.group(6) # Channel ID elif video_id == 'user': channel_username = m.group(6) # Channel Username
What did I learn from TubeStats?
Lots can be learned from this project.
Projects can aid with learning and keep you interested. Nothing helps more than having a question and answering it. This means all your research and what you learn is relevant and likely to stick.
Also, it is important to get to a ‘minimum viable product’. If a project is overwhelming, break it down to a simple level to get something that just works. Then, focus on making the code performant, clean etc.
What can I work on in the future?
Errors. There is no way to catch errors. What will be printed is the errors provided by the code itself.
Better performance. There are probably lots of areas to make the code run better. But I can come back to this later.
Async. I was playing around with some async libraries like
aioyoutube () and
ytpy. But same as the previous point, I’ll come to this later as this may require major code refactoring.
Consistency in posting videos is a major component in success in a YouTube channel. This app measures exactly that. Or at least tries.
The project implements the YouTube Data API along with
streamlit and pushed on to Heroku for hosting. An important lesson is building a project to a stage that is minimally viable. Focus on perfecting it later on.
I’ve taken a lot from this project as well as there is more that I can work on.
- Ali Abdaal’s website: https://aliabdaal.com/
- Youtube Data API: https://developers.google.com/youtube/v3/
- Pandas documentation: https://pandas.pydata.org/
- Pytest: https://docs.pytest.org/en/6.2.x/
- Altair: https://altair-viz.github.io/
- Streamlit: https://streamlit.io/
- Heroku: https://www.heroku.com
- Deploying a streamlit app onto Heroku: https://towardsdatascience.com/a-quick-tutorial-on-how-to-deploy-your-streamlit-app-to-heroku-874e1250dadd