5 things you didn’t know about linux kernel code metrics
Posted by Maciej Sołtysiak on July 22, 2008
Recently Greg Kroah Hartman showed some very interesting Linux kernel development stats. I decided to do some too and the result are 5 cool things you probably didn’t know about the kernel code
These aren’t anything I’ve seen so far about the kernel.
Greg’s stats
First, let’s quickly summarize Greg’s findings related to kernel size (he also did a lot of work on who’s contributing. I’ll skip this here). Daily average (based on data from 2007-2008 period) is:
- 4300 lines added, 1800 lines removed, 1500 lines modified
- 3.69 changes per hour.
Greg also says that the kernel is growing 10% each year, with a current 9.2 million lines of which the biggest part are the drivers (55%). The core kernel is about 5%.
Those numbers seemed a little bit odd to me. Especially the 9 million lines. I wanted to check it myself. What I found out was that Greg wasn’t counting the pure source lines of code (SLOC), but all lines, he didn’t exclude comments and blank lines. That is why my metrics differ from his. It’s funny that the Wikipedia article on SLOC gives 5.2 million for 2.6.0 kernel, which also seems incorrect.
My stats
I started with writing a small script that:
- downloads a 2.6.0 kernel, analyzes it using SLOCCount written by David Wheeler
- patches to one step newer kernel and analyzes it using the same tool.
- goes to number 2 until patches run out at 2.6.26
Just in case I also used a different tool called cloc to analyze the same code. Word of comment on tools used is at the end of this post.
This script ate 477MB of disk space with tarballs and bzipped patches.
1. The kernel has just reached 6 millions lines with 2.6.26!
Yes, indeed, with 2.6.26 we’ve reached over 6 million lines of code. You can see that on the chart on the right (click for a normal size version).
Both SLOCCount and CLOC show similar results. What is interesting here is that:
- there’s over a million of blank lines,
- and a million lines of comments (which are of course important too),
- the code shows a faster-than-linear growth characteristic,
- if current speed is maintained I predict we might get 7 million with 2.6.30 and 8 million with 2.6.33, just look at the forecast.
2. It takes about 83 days (2¾ months) for a new kernel release!
As Greg Kroah-Hartman says, the current release scheme is solid and we’re getting an average of around 80-83 days between releases. That stability was starting around 2006 while the first 2.6 releases were more frequent and buggy. Here’s a graph and a table showing the numbers for the stable release cycle.
3. Number of files in the project continues to grow faster than linear
This means that not only the size of current code grows but lots of new things come around. And this is true. Think of virtualization infrastructures, wireless, new architectures (eg. OLPC was merged recently).
First, look at the sheer number of files and how much they weigh in MBs. To the right, blue line represents all files in the directory, green line shows the number files that were analyzed by SLOCCount and CLOC. Not all files are analyzed because not all contain code. Anyway this give an idea of how many files people put in the source code.
Size is growing very rapidly too. Recent kernels grow with an average 6,3 MB. The record winning kernel is 2.5.25 which gained a whopping 13MB. If you take the 83 day lifecycle this means that it was gaining around 80 kB per day! (It’s not just code, documentation adds up to the numbers)
It is quite educating to look at exactly what’s growing per directory in therms of SLOC. If you take a look at top 8 directories within the kernel you can notice that:
- drivers (/drivers) are a huge part and grow very quickly
- arch (/arch) started growing around 2.6.5
- network (/net) started growing around 2.6.13
- filesystems (/fs) do not grow that much but they have their bursts like with 2.6.16, 2.6.19, where bulks of code where merged
- network (/net) which stareted growing at 2.6.16, now outgrew sound (/sound)
I also did a graph with bottom with the LOC, but personally I don’t see it particuraly amusing, but here goes:
4. Daily SLOC added are over 1000 and this metric is also growing
The daily growth of SLOC for given releases varies of course. There are quite big differences between versions, however what can be certainly stated is that the trend is growing. Both the lower and upper bounds are at higher values with each new kernel.
Not incredily interesting but still, a metric is a metric and you can compare with other projects and your own programs
5. Language breakdown of 2.6.26 using CLOC
2 different reports. == CLOC == ----------------------------------------------------- Language files blank comment code ----------------------------------------------------- C 10195 921822 976772 4709722 C/C++ Header 9400 203125 321830 1096551 Assembler 1005 36250 42921 225549 make 1005 4569 5350 15238 Perl 19 1157 1256 6092 yacc 5 437 318 2919 Bourne Shell 48 404 1205 2623 C++ 1 205 58 1496 lex 5 225 248 1395 HTML 2 58 0 367 NAnt scripts 1 83 0 290 Lisp 1 63 0 218 Python 2 41 37 208 ASP 1 33 0 136 awk 2 14 7 98 Bourne Again Shell 2 7 17 34 XSLT 1 0 1 7 ----------------------------------------------------- SUM: 21695 1168493 1350020 6062943 ----------------------------------------------------- == SLOCCount == ansic: 5780304 (96.08%) asm: 218132 (3.63%) perl: 6075 (0.10%) cpp: 3242 (0.05%) yacc: 2919 (0.05%) sh: 2609 (0.04%) lex: 1825 (0.03%) python: 331 (0.01%) lisp: 218 (0.00%) pascal: 116 (0.00%) awk: 96 (0.00%)
Word of comment on tools used
SLOCCount is very fast, CLOC is very slow (crunching over 10 hours with CLOC). The results of SLOC are similar, there’s a difference around 1% between them, so It’s neglible. The output results were processed and put into a CSV file and processed with JpGraph. Why JpGraph? Because I wanted to try it out, just that
Cheers!
Links
- Linux Kernel Development Stats from Greg Kroah Hartman
- Greg Kroah Hartman on the Linux Kernel
- David Wheeler’s SLOCCount
- CLOC – Count Lines of Code
- JpGraph – Graph creating library for PHP








Tuxo said
Hi, interesting information, althoug the 3_files.png (File in Linux kernel source) plot is not faster than linear, I think it has two linear segments with a regime change between 2.6.7 and 2.6.8 kernel verions. I am not familiar with kernel history therefore I can not tell what is the cause, it could be new features, new drivers, new developers, a mix of any of these, who knows, SCO secret code (just kidding).
For other plots it is clear the behavior is faster than linear, maybe cuadratic.
SALUDOS
TUXO
Ken Ryan said
(Linked here from Groklaw.net)
Interesting stats.
It’s not clear to me that things like Kconfig files are included. They are part of the build structure, but are not a standard language. They are critical parts of the make system however.
Also I might quibble with the concept of removing documentation files from the stats. IMHO documentation is as important as the code itself. There has been a few people putting special effort into improving kernel documentation, which leads to increased robustness and ease of adding *more* code.
Anyway, I enjoyed your article. Thanks!
cheve said
I am surprised that only 1 C++ language file is detected by the SLOC.
Rhialto said
While your article doesn’t really say anything on the matter, I’d like to add that “more lines of code” is not necessarily a good thing. It can be more features, but it can also be more bloat. A code shrink from time to time (when doing a rewrite in a more efficient way) can be good too.
Maciej Sołtysiak said
Hi,
some feedback:
- now that you mention it, the number of files does look more like 2 linear segments.
- 1 C++ file is reported by CLOC. SLOCCount reports a different structure. I’ll add it to the article:
ansic: 5780304 (96.08%)
asm: 218132 (3.63%)
perl: 6075 (0.10%)
cpp: 3242 (0.05%)
yacc: 2919 (0.05%)
sh: 2609 (0.04%)
lex: 1825 (0.03%)
python: 331 (0.01%)
lisp: 218 (0.00%)
pascal: 116 (0.00%)
awk: 96 (0.00%)
- these stats do include the Documentation directory, eg. 3_bottom.png file.
- it’s true that more lines of code does not mean better quality, I think we should have coverity analysis for that. One thing that these stats hide is the real rewrites. They do happen. Overall stats say about 4300 lines added, 1800 lines removed, 1500 lines modified. See? 1800 lines removed daily!
Thanks for insightfull pointers.
BobJ said
The usage of included files, macros, conditional code (ex: 32-bit vs. 64-bit), comments!, readability white space, etc. all means that lines of code as a metric is only a count of text lines and has nothing to do with the quality or quantity of code. Even counting actual number of bytes produced is suspect, as some well written sub-code (subroutines/functions) may be replaced with a combination of macros/declarative constants that at the least save the resources to call/return from sub-code and for some higher level-languages with inter-statement/intra-statement optimization that take a lot fewer resources.
I know of a project where the usage of macros (the code executed much faster due to inter-statement optimization of the macro code which was deliberately written to enhance this optimization) was effectively penalized, so a very astute programmer had the macros generate source files which up’d a groups source code count to a suitably impressive level.
Maciej Sołtysiak said
Yup, SLOC is a very rough statistic and don’t say much about the quality. Quality is something that could be measured by projects like Coverity, ie. static analysis of source code. SLOCcount however is wise enough to skip blank lines and comments. CLOC the same. SLOC as a metric mainly acts as basis for calculating the cost of developer effort, eg. using the COCOMO model. I didn’t want to do this analysis, because I am not knowledgeable enough to tweak the basic coefficients as well as David Wheeler did in his analysis.
Going back to the metrics. I’d say that it shows you how much code is being written and included into the kernel. It surely says about the quantity, because includes and macros are written and placed in one file and reused all over the project. The same goes for code that gets copied and modified only slightly. It still is quantity. I see your point though. Maybe it’d be better to run SLOC on the output of gcc -E (so the macros would be expanded) but seriously if have a mutliline macro like:
#define f_close(f) \
do { \
if ((f) != NULL) \
{ \
fclose((f));\
} \
else \
{\
(f)=NULL;\
printf("f_close: closing closed file?", 0);\
}\
} while (0);
and then when you reuse it just by typing:
f_close(fp);
in reality you’re adding 11 lines or so but you’r effort is this one line.
To recap, these stats are more biased towards effort to write the code.
Linux kernel 2.6.27 « Linux + Administrator = Linuxator said
[...] Comments (RSS) « 5 things you didn’t know about linux kernel code metrics [...]
dmtdev » Blog Archive » Linux Kernel Package Composition said
[...] full source and patch baseline packages. Kernel version 2.6.26 was over six million SLOCs (“5 things you didn’t know about linux kernel code metrics”. Maciej Sołtysiak. Accessed on 2008-10-08), so 250MB is reasonable. I downloaded the patch [...]