Linux + Administrator = Linuxator

Linux Adminstrator’s blog

  • Blog Stats

    • 60,242 hits

5 things you didn’t know about linux kernel code metrics

Posted by Maciej Sołtysiak on July 22, 2008

Recently Greg Kroah Hartman showed some very interesting Linux kernel development stats. I decided to do some too and the result are 5 cool things you probably didn’t know about the kernel code 😉

These aren’t anything I’ve seen so far about the kernel.

Greg’s stats

First, let’s quickly summarize Greg’s findings related to kernel size (he also did a lot of work on who’s contributing. I’ll skip this here). Daily average (based on data from 2007-2008 period) is:

  • 4300 lines added, 1800 lines removed, 1500 lines modified
  • 3.69 changes per hour.

Greg also says that the kernel is growing 10% each year, with a current 9.2 million lines of which the biggest part are the drivers (55%). The core kernel is about 5%.

Those numbers seemed a little bit odd to me. Especially the 9 million lines. I wanted to check it myself. What I found out was that Greg wasn’t counting the pure source lines of code (SLOC), but all lines, he didn’t exclude comments and blank lines. That is why my metrics differ from his. It’s funny that the Wikipedia article on SLOC gives 5.2 million for 2.6.0 kernel, which also seems incorrect.

My stats

I started with writing a small script that:

  1. downloads a 2.6.0 kernel, analyzes it using SLOCCount written by David Wheeler
  2. patches to one step newer kernel and analyzes it using the same tool.
  3. goes to number 2 until patches run out at 2.6.26

Just in case I also used a different tool called cloc to analyze the same code. Word of comment on tools used is at the end of this post.

This script ate 477MB of disk space with tarballs and bzipped patches.

1. The kernel has just reached 6 millions lines with 2.6.26!

Linux kernel lines of code

Linux kernel lines of code

Yes, indeed, with 2.6.26 we’ve reached over 6 million lines of code. You can see that on the chart on the right (click for a normal size version).

Both SLOCCount and CLOC show similar results. What is interesting here is that:

  • there’s over a million of blank lines,
  • and a million lines of comments (which are of course important too),
  • the code shows a faster-than-linear growth characteristic,
  • if current speed is maintained I predict we might get 7 million with 2.6.30 and 8 million with 2.6.33, just look at the forecast.
Linux kernel lines of code forecast

Linux kernel lines of code forecast

2. It takes about 83 days (2¾ months) for a new kernel release!

As Greg Kroah-Hartman says, the current release scheme is solid and we’re getting an average of around 80-83 days between releases. That stability was starting around 2006 while the first 2.6 releases were more frequent and buggy. Here’s a graph and a table showing the numbers for the stable release cycle.

Days between releases of Linux  kernel

Days between releases of Linux kernel

3. Number of files in the project continues to grow faster than linear

This means that not only the size of current code grows but lots of new things come around. And this is true. Think of virtualization infrastructures, wireless, new architectures (eg. OLPC was merged recently).

File in Linux kernel source

File in Linux kernel source

First, look at the sheer number of files and how much they weigh in MBs. To the right, blue line represents all files in the directory, green line shows the number files that were analyzed by SLOCCount and CLOC. Not all files are analyzed because not all contain code. Anyway this give an idea of how many files people put in the source code.

Size of Linux kernel source code

Size of Linux kernel source code

Size is growing very rapidly too. Recent kernels grow with an average 6,3 MB. The record winning kernel is 2.5.25 which gained a whopping 13MB. If you take the 83 day lifecycle this means that it was gaining around 80 kB per day! (It’s not just code, documentation adds up to the numbers)

Top 8 directories in the kernel source

Top 8 directories in the kernel source

It is quite educating to look at exactly what’s growing per directory in therms of SLOC. If you take a look at top 8 directories within the kernel you can notice that:

  • drivers (/drivers) are a huge part and grow very quickly
  • arch (/arch) started growing around 2.6.5
  • network (/net) started growing around 2.6.13
  • filesystems (/fs) do not grow that much but they have their bursts like with 2.6.16, 2.6.19, where bulks of code where merged
  • network (/net) which stareted growing at 2.6.16, now outgrew sound (/sound)

I also did a graph with bottom with the LOC, but personally I don’t see it particuraly amusing, but here goes:

The rest of the directories

The rest of the directories

4. Daily SLOC added are over 1000 and this metric is also growing

LOC/day growth between versions

LOC/day growth between versions

The daily growth of SLOC for given releases varies of course. There are quite big differences between versions, however what can be certainly stated is that the trend is growing. Both the lower and upper bounds are at higher values with each new kernel.

Not incredily interesting but still, a metric is a metric and you can compare with other projects and your own programs 😉

5. Language breakdown of 2.6.26 using CLOC

2 different reports.
== CLOC ==
-----------------------------------------------------
Language          files     blank   comment      code
-----------------------------------------------------
C                 10195    921822    976772   4709722
C/C++ Header       9400    203125    321830   1096551
Assembler          1005     36250     42921    225549
make               1005      4569      5350     15238
Perl                 19      1157      1256      6092
yacc                  5       437       318      2919
Bourne Shell         48       404      1205      2623
C++                   1       205        58      1496
lex                   5       225       248      1395
HTML                  2        58         0       367
NAnt scripts          1        83         0       290
Lisp                  1        63         0       218
Python                2        41        37       208
ASP                   1        33         0       136
awk                   2        14         7        98
Bourne Again Shell    2         7        17        34
XSLT                  1         0         1         7
-----------------------------------------------------
SUM:              21695   1168493   1350020   6062943
-----------------------------------------------------

== SLOCCount ==
ansic:		5780304	(96.08%)
asm:		218132	(3.63%)
perl:		6075	(0.10%)
cpp:		3242	(0.05%)
yacc:		2919	(0.05%)
sh:		2609	(0.04%)
lex:		1825	(0.03%)
python:		331	(0.01%)
lisp:		218	(0.00%)
pascal:		116	(0.00%)
awk:		96	(0.00%)

Word of comment on tools used

SLOCCount is very fast, CLOC is very slow (crunching over 10 hours with CLOC). The results of SLOC are similar, there’s a difference around 1% between them, so It’s neglible. The output results were processed and put into a CSV file and processed with JpGraph. Why JpGraph? Because I wanted to try it out, just that 🙂

Cheers!

Links

  1. Linux Kernel Development Stats from Greg Kroah Hartman
  2. Greg Kroah Hartman on the Linux Kernel
  3. David Wheeler’s SLOCCount
  4. CLOC – Count Lines of Code
  5. JpGraph – Graph creating library for PHP
Advertisements

18 Responses to “5 things you didn’t know about linux kernel code metrics”

  1. Tuxo said

    Hi, interesting information, althoug the 3_files.png (File in Linux kernel source) plot is not faster than linear, I think it has two linear segments with a regime change between 2.6.7 and 2.6.8 kernel verions. I am not familiar with kernel history therefore I can not tell what is the cause, it could be new features, new drivers, new developers, a mix of any of these, who knows, SCO secret code (just kidding).

    For other plots it is clear the behavior is faster than linear, maybe cuadratic.

    SALUDOS
    TUXO

  2. Ken Ryan said

    (Linked here from Groklaw.net)

    Interesting stats.

    It’s not clear to me that things like Kconfig files are included. They are part of the build structure, but are not a standard language. They are critical parts of the make system however.

    Also I might quibble with the concept of removing documentation files from the stats. IMHO documentation is as important as the code itself. There has been a few people putting special effort into improving kernel documentation, which leads to increased robustness and ease of adding *more* code.

    Anyway, I enjoyed your article. Thanks!

  3. cheve said

    I am surprised that only 1 C++ language file is detected by the SLOC.

  4. Rhialto said

    While your article doesn’t really say anything on the matter, I’d like to add that “more lines of code” is not necessarily a good thing. It can be more features, but it can also be more bloat. A code shrink from time to time (when doing a rewrite in a more efficient way) can be good too.

  5. Hi,

    some feedback:
    – now that you mention it, the number of files does look more like 2 linear segments.
    – 1 C++ file is reported by CLOC. SLOCCount reports a different structure. I’ll add it to the article:

    ansic: 5780304 (96.08%)
    asm: 218132 (3.63%)
    perl: 6075 (0.10%)
    cpp: 3242 (0.05%)
    yacc: 2919 (0.05%)
    sh: 2609 (0.04%)
    lex: 1825 (0.03%)
    python: 331 (0.01%)
    lisp: 218 (0.00%)
    pascal: 116 (0.00%)
    awk: 96 (0.00%)

    – these stats do include the Documentation directory, eg. 3_bottom.png file.
    – it’s true that more lines of code does not mean better quality, I think we should have coverity analysis for that. One thing that these stats hide is the real rewrites. They do happen. Overall stats say about 4300 lines added, 1800 lines removed, 1500 lines modified. See? 1800 lines removed daily!

    Thanks for insightfull pointers.

  6. BobJ said

    The usage of included files, macros, conditional code (ex: 32-bit vs. 64-bit), comments!, readability white space, etc. all means that lines of code as a metric is only a count of text lines and has nothing to do with the quality or quantity of code. Even counting actual number of bytes produced is suspect, as some well written sub-code (subroutines/functions) may be replaced with a combination of macros/declarative constants that at the least save the resources to call/return from sub-code and for some higher level-languages with inter-statement/intra-statement optimization that take a lot fewer resources.

    I know of a project where the usage of macros (the code executed much faster due to inter-statement optimization of the macro code which was deliberately written to enhance this optimization) was effectively penalized, so a very astute programmer had the macros generate source files which up’d a groups source code count to a suitably impressive level.

  7. Yup, SLOC is a very rough statistic and don’t say much about the quality. Quality is something that could be measured by projects like Coverity, ie. static analysis of source code. SLOCcount however is wise enough to skip blank lines and comments. CLOC the same. SLOC as a metric mainly acts as basis for calculating the cost of developer effort, eg. using the COCOMO model. I didn’t want to do this analysis, because I am not knowledgeable enough to tweak the basic coefficients as well as David Wheeler did in his analysis.

    Going back to the metrics. I’d say that it shows you how much code is being written and included into the kernel. It surely says about the quantity, because includes and macros are written and placed in one file and reused all over the project. The same goes for code that gets copied and modified only slightly. It still is quantity. I see your point though. Maybe it’d be better to run SLOC on the output of gcc -E (so the macros would be expanded) but seriously if have a mutliline macro like:

    #define f_close(f) \
    do { \
    if ((f) != NULL) \
    { \
    fclose((f));\
    } \
    else \
    {\
    (f)=NULL;\
    printf("f_close: closing closed file?", 0);\
    }\
    } while (0);

    and then when you reuse it just by typing:

    f_close(fp);

    in reality you’re adding 11 lines or so but you’r effort is this one line.

    To recap, these stats are more biased towards effort to write the code.

  8. […] Comments (RSS) « 5 things you didn’t know about linux kernel code metrics […]

  9. […] full source and patch baseline packages. Kernel version 2.6.26 was over six million SLOCs (“5 things you didn’t know about linux kernel code metrics”. Maciej Sołtysiak. Accessed on 2008-10-08), so 250MB is reasonable. I downloaded the patch […]

  10. Nautilus said

    Nautilus…

    […]5 things you didn’t know about linux kernel code metrics « Linux + Administrator = Linuxator[…]…

  11. amit kumar said

    I want to study the linux kernel code and install my own .There are various links on the internet. but i am not getting a good tutorials..pls help

    • Hi Amit,

      To study the code, just download a stable version from http://kernel.org and look around sources and documentation. I also recommend getting a tutorial on writing kernel modules. They are easy to start with.

      To install, you should find a tutorial on installing a “vanilla” kernel for your distribution.
      Basically the process explained would be to get the kernel source from kernel.org or git. Then use your current .config file to do make oldconfig in the kernel source to replicate your current settings. Then it’s convenient to use make menuconfig to change options. You can do modifications or write your code. Compile, install the kernel (varies between systems how you do it, so get a tutorial) and hope it works 🙂

      Good luck!

  12. tonners said

    uk, our customers can keep away from extra trips to the store to buy inkjet
    ink cartridges and easily access our high-quality
    cartridges that we develop ourselves. What this really means is
    if you will need more than 200 or so pages
    printed in a month, you should consider a printer with
    a higher duty cycle. In the case of printer manufacturers, things are a definite
    little different.

  13. Trudy said

    I think the admin of this site is genuinely working hard in favor of his website,
    since here every stuff is quality based material.

  14. Everything is very open with a very clear clarification of the issues.

    It was truly informative. Your site is very helpful. Many thanks
    for sharing!

  15. It’s really very complex in this busy life to listen news on Television,
    thus I simply use internet for that purpose, and take the most recent news.

  16. I’m excited to find this web site. I want to to thank you for ones
    time just for this fantastic read!! I definitely really
    liked every bit of it and i also have you bookmarked to check out new information on your web site.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: