ali abdaal
Shivan

Shivan

My TubeStats App (ft. Ali Abdaal)

Print Friendly, PDF & Email

With over 100 million views and one million subscribers, Ali Abdaal is definitely one to advise on building a successful YouTube channel. Besides his advice to aspiring YouTuber to start today, he also preaches about consistency.

And that is where the motivation of this app derives. The question: how consistent is Ali Abdaal’s YouTube channel? Then what about other YouTube channels?

We will see the use of TubeStats to answer this question. We will also go through how it works. I’ll also provide my personal leanings as well as what I plan to do in the future.

Link to app: https://www.tubestats.app

Github: https://www.github.com/shivans93/tubestats

What TubeStats does?

The app takes user input. This can be:

  • channel ID (e.g. UCoOae5nYA7VqaXzerajD0lg),
  • link to the channel (e.g. https://www.youtube.com/channel/UCoOae5nYA7VqaXzerajD0lg,
  • link to a video of the interested channel (e.g. https://www.youtube.com/watch?v=epF2SYpWtos, or
  • video ID (e.g. epF2SYpWtos).

After a few moments, it will produce summary statistics for the channel. Examples include total video count and total watch time as shown below.

TubeStats ali abdaal

Here, we also have the channel avatar image as well as a description provided on the channel.

Next, we have a graph that summaries all the videos in the channel.

The plot involves time of post against the natural logarithm of view count. The natural logarithm is used because the data is skewed towards the tail. This is due to the ‘viral’ nature of videos.

The colour of the circles represent the like-dislike ratio of the video. The size of the circles is related to their absolute view count.

The date range can be altered. In this example, I’ve altered the date range because Ali Abdaal started taking his channel more seriously in July 2017.

Another graph shows the number of days between videos. For example, in the dates selected above, the longest period of inactivity was just under 22 days with the “My Favourite iPad Pro Apps” video. From the video’s date, this was around the festive season.

On average, Ali Abdaal is able to put our a video every 3 days with a majority (75%) being released within 5 days.

At the end, TubeStats provides a list of video with the greatest views in the channel along with the videos with the least like-dislike ratio.

How does TubeStats work?

The major working parts:

  1. YouTube Data API
  2. pandas
  3. streamlit
  4. Heroku

I’ve also divided this into hurdles and provided my solutions in overcoming them.

Hurdle #1: how do I set up my development environment?

I’m using Linux. Setting up our development environment is essential for our productivity in coding.

We start by creating our directory. Following this, we create a virtual environment to manage our packages without interfering with other projects. Finally, we initialise git to track changes and allow ease of sharing our code to the world.

$ mkdir tubestats
$ cd tubestats

$ python3 -m venv venv
$ source venv/bin/activate
$ (venv)

$ git init

Hurdle #2: how do we access the video information?

We could scrap every single video for its view count, comment count etc. This would take a long time. Fortunately, we can use the YouTube Data API. This provides easy access to statistics for YouTube videos and channels. To access this, we must set up our Google Cloud Console platform.

From this, we create a ‘New Project’. We activate the YouTube Data API v3 app. Finally, we create credentials (click here for a more in depth). These steps will provide an key to access the YouTube Data API.

Hurdle #3: how do we store passwords and API keys?

We want to share our code but not our passwords and API keys. How do we do this?

We can use a third party package: python-dotenv.

This is important so we can access the data locally. We will see how when we push to web serving with Heroku, we have to use a different method.

We can store the key in a .env file, and we can add this file to the .gitignore so it doesn’t get shared.

.env file
 APIKey=xxxAPIKEYherexxx

We install the module that allow access to the the YouTube API. We do this using pip, while having the virtual environment active.

$ (venv) pip install google-api-python-client

Hurdle #4: how do we get the video statistics?

If we have a channel ID, we can use this to obtain a playlist with all IDs of all videos uploaded from the channel. We can then call the video ID and then obtains the statistics we are interested in.

        def get_video_data(self) -> pd.core.frame.DataFrame:

         ...

         while True:                     
            # obtaining video ID + titles                       
            playlist_request = self.youtube.playlistItems().list(                    
                    part='snippet,contentDetails',                 
                    maxResults=50, # API Limit is 50                  
                    pageToken=next_page_token,                    
                    playlistId=upload_playlist_ID,                
                    )                 
            playlist_response = playlist_request.
            # isolating video ID
            vid_subset = [ vid_ID['contentDetails']['videoId'] for vid_ID in playlist_response['items']
            # retrieving video ID
            vid_info_subset_request = self.youtube.videos().
                part='snippet,contentDetails,
                id=vid_subset
                )       
                
            vid_info_subset_response = vid_info_subset_request.    
            video_response.append(    
            # obtaining page     
            next_page_token = playlist_response.get('nextPageToken') 
            # get method used because token may not exist
            if next_page_token is 
                break
        df = pd.json_normalize(video_response, 'items')       
        return df

Here we call the API, but can only get 50 video IDs at a time. This is thanks to pagination. Every time a call is made, a page token is provided if there are more than 50 videos. The page token points to the next page. If the page token is returns None that means all IDs have been exhausted. We use a while loop to obtain the video IDs. The while loop stops once we have no more page tokens.

We append this output into a python list, which is a list of strings in JSON format. We store this information in a pandas DataFrame. We can do this thanks to a built-in function json_normalize(). From this, we have our dataframe.

Hurdle #5: how to organise the code?

Now that our code is starting to take flight, it’s becoming hard to fit all this code into one file. This is where we use different files and directories for the organisation.

├── data
 │   ├── channel_data.pkl
 │   └── video_data.pkl
 ├── LICENSE
 ├── Procfile
 ├── README.MD
 ├── requirements.txt
 ├── setup.sh
 ├── tests
 │   ├── __init__.py
 │   ├── test_settings.py
 │   ├── test_youtube_api.py
 │   ├── test_youtube_data.py
 │   └── test_youtube_parser.py
 ├── tubestats
 │   ├── __init__.py
 │   ├── youtube_api.py
 │   ├── youtube_data.py
 │   └── youtube_parser.py
 └── youtube_presenter.py

The main sub-directories to note are tubestats which contains the python source code to access the API, wrangle the data, and produce underlying graphs to present the data. tests contain test code to test the tubestats module.

Finally, youtube_presenter.py is what presents the code. We see some other files which we will address later on.

Hurdle #6: how to test code?

It is important to ensure that our code works. In this case, I use pytest. Here is an example of testing the above get_video_data() function.

 from tubestats.youtube_api import create_api, YouTubeAPI
 from tests.test_settings import set_channel_ID_test_case

 from pathlib import Path

 import pytest

 import googleapiclient
 import pandas

 def test_create_api():
     youtube = create_api()
     assert isinstance(youtube, googleapiclient.discovery.Resource)

 @pytest.fixture()
 def youtubeapi():
     channel_ID = set_channel_ID_test_case()
     yt = YouTubeAPI(channel_ID)
     return yt

 def test_get_video_data(youtubeapi):
     df = youtubeapi.get_video_data()
     assert isinstance(df, pandas.core.frame.DataFrame)
 
     # saving video data to save API calls for later test 
     BASE_DIR = Path(file).parent.parent
     df.to_pickle(BASE_DIR / 'data' / 'video_data.pkl')

In this case, we import our source code, along with associated modules. We use the pytest.fixture() decorator. This allows us to reuse the results from the data we pull in our test case. We ‘pickle’ the data so we can use this instead of making more API calls for other tests.

I can probably do better than isinstance, but this will have to do.

If we run pytest in the terminal, this will test the functionality of our code.

Hurdle #7: how to display this data for others to interact?

This is done through the combination of altair, which provides us with interactive graphs. Also, we use streamlit to display these graphs and allow interaction.

The following code creates a graph that displays all the videos over time.

import altair as alt
 def scatter_all_videos(self, df: pd.core.frame.DataFrame) -> alt.vegalite.v4.Chart:
df_views = df
         c = alt.Chart(df_views, title='Plot of videos over time').mark_point().encode(
                 x=alt.X('snippet.publishedAt_REFORMATED:T', axis=alt.Axis(title='Date Published')),
                 y=alt.Y('statistics.viewCount_NLOG:Q', axis=alt.Axis(title='Natural Log of Views')),
                 color=alt.Color('statistics.like-dislike-ratio:Q', scale=alt.Scale(scheme='turbo'), legend=None),
                 tooltip=['snippet.title:N', 'statistics.viewCount:Q', 'statistics.like-dislike-ratio:Q'],
                 size=alt.Size('statistics.viewCount:Q', legend=None)
         )
         return c

Next, we can display this graph with the ability to edit dates.

import streamlit as st

 def date_slider(date_end=datetime.today()):
         date_start, date_end = st.slider(
                 'Select date range to include:',
                 min_value=first_video_date, # first video
                 max_value=last_video_date, #value for date_end
                 value=(first_video_date , last_video_date), #same as min value
                 step=timedelta(days=2),
                 format='YYYY-MM-DD',
                 key=999)
         return date_start, date_end

    date_start, date_end = date_slider()
     transformed_df = youtuber_data.transform_dataframe(date_start=date_start, date_end=date_end) 
     c = youtuber_data.scatter_all_videos(transformed_df)
     st.altair_chart(c, use_container_width=True)

This is what we see when running the program locally, streamlit run youtube_presenter.py:

Hurdle #7: how do I show off?

There is no point in this sitting on our computers. We need to share this with the world. Github can be used to share code, but what about the non-technical among us?

The solution is to use Heroku to host.

First, what we will do is create an account on Heroku. A free account will suffice.

The code we have ensured works and we can push this onto GitHub. Next, we create a requirements.txt, setup.sh, and Procfile.

The requirements.txt file will contain out libraries. We can use the command:

$ (venv) pip freeze > requirements.txt

The setup.sh file is created:

mkdir -p ~/.streamlit/echo "\\
[server]\\n\\
headless = true\\n\\
port = $PORT\\n\\
enableCORS = false\\n\\
\\n\\
" > ~/.streamlit/config.toml

Then the Procfile tells Heroku that this is a web application. It also instructs setup.py to run file as well as run our code using streamlit.

web: sh setup.sh && streamlit run app.py

Next, we install the we install the Heroku Command Line Interface.

We can then push out code to Heroku just like we would for a GitHub repository.

$ heroku login
$ heroku create tubestats

$ git push heroku main

And here we are, the final product.

We can also attach a domain to customise the app even more.

Hurdle #8: what if a video ID is inputted?

So far, this only works with channel ID input. What about video ID? a URL? This is where we need to parse the input.

We use a combination of regex and checking lengths.

For example, a channel ID is exactly 24 characters long, a video ID is 11 characters long.

If a link is provided, we can use regular expression (regex, using the re module) to extract the video ID or channel ID. There are plenty of websites that help with regex. Here is one.

Here is an example of applying regex.

import re

LINK_MATCH = r'(^.*youtu)(\\.be|be\\.com)(\\/watch\\?v\\=|\\/)([a-zA-Z0-9_-]+)(\\/)?([a-zA-Z0-9_-]+)?'
m = re.search(LINK_MATCH, for_parse)
video_id = m.group(4) # video ID
if video_id == 'channel':
    return m.group(6) # Channel ID
elif video_id == 'user':
    channel_username = m.group(6) # Channel Username

What did I learn from TubeStats?

Lots can be learned from this project.

Projects can aid with learning and keep you interested. Nothing helps more than having a question and answering it. This means all your research and what you learn is relevant and likely to stick.

Also, it is important to get to a ‘minimum viable product’. If a project is overwhelming, break it down to a simple level to get something that just works. Then, focus on making the code performant, clean etc.

What can I work on in the future?

Errors. There is no way to catch errors. What will be printed is the errors provided by the code itself.

Better performance. There are probably lots of areas to make the code run better. But I can come back to this later.

Async. I was playing around with some async libraries like aioyoutube (https://github.com/im-mde/aioyoutube.py) and ytpy. But same as the previous point, I’ll come to this later as this may require major code refactoring.

Conclusion

Consistency in posting videos is a major component in success in a YouTube channel. This app measures exactly that. Or at least tries.

The project implements the YouTube Data API along with pandas, streamlit and pushed on to Heroku for hosting. An important lesson is building a project to a stage that is minimally viable. Focus on perfecting it later on.

I’ve taken a lot from this project as well as there is more that I can work on.

References

  1. Ali Abdaal’s website: https://aliabdaal.com/
  2. Youtube Data API: https://developers.google.com/youtube/v3/
  3. Pandas documentation: https://pandas.pydata.org/
  4. Pytest: https://docs.pytest.org/en/6.2.x/
  5. Altair: https://altair-viz.github.io/
  6. Streamlit: https://streamlit.io/
  7. Heroku: https://www.heroku.com
  8. Deploying a streamlit app onto Heroku: https://towardsdatascience.com/a-quick-tutorial-on-how-to-deploy-your-streamlit-app-to-heroku-874e1250dadd

Related posts

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on email
Print Friendly, PDF & Email

Leave a Comment

Your email address will not be published. Required fields are marked *