Sunday 20 January 2013

Benford's Law and YouTube Videos

The latest video on Numberphile deals with Benford's Law.

This relates to the distribution of LEADING DIGITS in a collection of numbers.

Basically, it turns out the digit "1" turns up nearly a third of the time, under the right circumstances.

It's all explained very well by Steve Mould here in the Numberphile video.



Of course for even more detail, there's always Wikipedia.

Taking things a bit further I decided to analyse Numberphile's YouTube figures - see if they obeyed Benford's Law?

(Well actually it was done for me by a chap named Daniel who is far better with spreadsheets!)

Here's the resulting "extra" film, also posted on Numberphile:



And here are the Numberphile viewing figures, charting the videos' durations (in seconds), views and number of comments.



Now, just as a reminder, below is a perfect Benford distribution - the curve is quite obvious!


And to make thing more visible, here are the Numberphile video durations (in seconds) distributed by leading digit:



Not very Benford-esque are they? More on this later.

Here are the viewing counts (again, remember this only relates to the distribution of leading digits):



Closer. The 1s are certainly dominant.

Now here are the viewer comments:



Not quite right, is it?

But perhaps the sample size is too small, with just over 100 Numberphile videos.

So next I sent Daniel the stats for ALL MY CHANNELS:


And again, you'll see below that the video durations do not give a Benford curve:


But take a look at the graph we get from the leading digits on view counts:


Much better. And the same applies to the number of comments on videos:


So Benford's Law seems to be holding true.

I guess the big question is, why does it not apply to video durations?

I'd love to hear people's explanations.

Here's the one Daniel put forward, which makes sense to me.

Essentially, the durations are "planned" (by me). Videos that are too long may be unpalatable to viewers.

For example, few films will be longer than 10 minutes - a psychological barrier because of the double digit.

(In fact, for quite some time I was not able to post videos longer than 10 minutes... and of all my videos just 5% exceed 10 minutes)

So for videos longer than 10 minutes and using seconds as our unit of measurement, the durations (and leading digits) can be grouped like this:

Leading Digit 1 - Applies to films of duration 1 second, 10-19 secs and 1:40-3:19 (111 contributing values, 18.5%)
Leading Digit 2 - 2 seconds, 20-29 secs and 3:20-4:59 (111 values, 18.5%) 
Leading Digit 3 - 3 seconds, 30-39 secs and 5:00-6:39 (111 values, 18.5%) 
Leading Digit 4 - 4 seconds, 40-49 secs and 6:40-8:19 (111 values, 18.5%) 
Leading Digit 5 - 5 seconds, 50-59 secs and 8:20-9:59 (111 values, 18.5%) 
Leading Digit 6 - 6 secs and 1:00-1:09 (11 values, 1.84%) 
Leading Digit 7 - 7 secs and 1:10-1:19 (11 values, 1.84%) 
Leading Digit 8 - 8 secs and 1:20-1:29 (11 values, 1.84%) 
Leading Digit 9 - 9 secs and 1:30-1:39 (11 values, 1.84%)

So you can see most of the films (when only including those under 10 minutes) will fall into the 1-5 groupings.

And the duration graphs from my videos back this up, with a stronger distribution in the 1-5 slots, and then dropping off in the 6-9 slots.



26 comments:

  1. I agree with Daniel. As Youtube didn't let you upload videos longer than 10 minutes, the duration of the videos ranged from 1 to 10 minutes, wich is from 60 to 600 seconds. I'm not entirely sure, but most of your videos go from 4 to 6 minutes (240 to 360 seconds, wich makes sense when you look at the graph). Fewer of them are 7 to 10 minutes long (420 to 600 seconds), and very rarely you make a shorter video, during 70 to 90 seconds.
    The length of a video is not random, it has much more to do with people's attention span (a longer video would be boring for some people, and some others would not have time to watch it to the end) and other factors, like how much information you have or how complicated it is to explain what you want to show.

    ReplyDelete
    Replies
    1. Try out set of random numbers with Benford's law ..... Surprise it doesn't work! You get a flat distribution because the probability of landing on any number is the same! It only works on natural sets of numbers. So if the durations were random, it would fit even worse to distribution expected under Benford's law.

      Delete
    2. I couldn't agree more, Miguel!

      Delete
  2. I think that the problem with the duration is that there isn't such a big sample, am i right?

    ReplyDelete
  3. Yeah, duration is a bit problematic just because of the fact you pointed out. However it could work on movies where 1:40 minute intervals are not so big. Like you said, the sampling must be large enough, but so has to be the base you're taking that.

    ReplyDelete
    Replies
    1. Even on movies, most movies range from 1-3 hours, which is 60-180 minutes or 3600 - 10800 seconds. That's still probably not a wide enough distribution to see a clear Benford curve; relatively few movies are < 2000 seconds long or > 10000.

      Delete
  4. its not just the sample size, is the width of the distribution. if you assume Brady's videos range from ~1 to 10 minutes, that's 60 - 600 seconds, just one order of magnitude. For a good fit to a Benford curve you typically need at least 3 orders of magnitude, which would require videos greater than 1.5 hrs long.

    ReplyDelete
  5. So Brady... Are you going to make more videos ranging from 1:40 minutes (100 seconds) to 5:00 minutes (300 seconds) long so that the duration curve looks more like Benford's Law?

    ReplyDelete
  6. Yeah, i think that is the reason too, but i would also add that the distribution from 1-5 leading digits: for example, as you increase the leading digit, also increases the duration time that contributes the most. For 1 the most important contribution comes from 1:40 to 3:49, for 2 4:00-5:39 etc. I would guess that one of those intervals contains the average value of your video duration, around which the duration is most likely to be found. For example if the average duration would be 6:01 the most common leading digit would be 3.. This would also explain why in the graph showing the total video duration doesn't really follow the equally probable distribution of 18.5% for leading digits 1-5.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. I wonder if converting the duration into a smaller unit, say time in milliseconds would have any effect on the distribution of leading digits. I have a feeling it might lead to a more Benford-like curve

    ReplyDelete
    Replies
    1. Time in milliseconds wouldn't change the leading digit though. Eg. 1 second=1000ms, 2=2000 etc.

      Delete
    2. True enough. I didn't completely think through that train of thought.

      Delete
  9. A thought; what if you combine the samples? Everything's converted to pure numbers so it doesn't matter about mixing the data, and you'd be sampling over a greater range of magnitudes; you'd also be effectively tripling your sample size without having to find more samples!

    ReplyDelete
  10. Your video lengths don't exactly span several orders of magnitude. You don't go from tens to hundreds to thousands of seconds videos. You mostly stay in the hundreds of seconds magnitude. That was one of the requirements for the law working.

    ReplyDelete
  11. I believe that If the time of the videos was changed into frames, the curve might look better

    ReplyDelete
  12. I agree with the hypotheses that, due to several causes, the range of durations is limited, especially in magnitude.
    My hypothesis is that if we use a unit with more magnitude in this range may help, for instance Planck units. Any likelihood of you posting the raw data to let us try to confirm?

    ReplyDelete
  13. What does the curve look like if you combine the Duration, View, and Comment counts into one? As I understood the original video, taking multiple, even wildly different, measurements should end up in the curve?

    ReplyDelete
    Replies
    1. are you talking about the average of the three, or the total?

      Delete
  14. Hi there, I found your blog via Google while searching for such kinda informative post and your post looks very interesting for me.
    Buy Facebook Likes

    ReplyDelete
  15. 1)Have you tried to take the duration in miliseconds??
    2)I think one reason the Benford's Law is not valid here is because the time is measured in a mixed base (minutes divide by 60, seconds divide by any power of 10). I propose "cheating" the way we measure time...it's a complicated thing but...First let's convert all the durations into seconds, then let's consider 1 minute= 100 seconds, then do the statistics...maybe the obtained curve gets more Benfordesque :)

    ReplyDelete
  16. Great Article it its really informative and innovative keep us posted with new updates. its was really valuable. thanks a lot. Buy High PR BackLinks …….. Blog Comments

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. I have been seeing 11:11 for 4-6 months to the point it is undeniable to others around me. I have an excellent use case if anyone wants to work on an unpresedented project with me.
    macdougall.jesse@gmail.com 604.4742484

    ReplyDelete
  19. I see an inherent Bias in the digits of the number system. That is my first instinct because that is a bias from 10's to a 60's as the number system. But it would be evened out in some double bind relationship, when you convert minutes into seconds.

    So is there an update on this post about this discussion?

    ReplyDelete