Fwd: [Pharo-dev] Characterizing Pharo Code: A Technical Report

SD
Stéphane Ducasse
Wed, Jan 15, 2020 11:56 AM

Begin forwarded message:

From: Oleksandr Zaytsev olk.zaytsev@gmail.com
Subject: [Pharo-dev] Characterizing Pharo Code: A Technical Report
Date: 15 January 2020 at 12:02:06 CET
To: pharo-dev@lists.pharo.org
Reply-To: Pharo Development List pharo-dev@lists.pharo.org

Hello,

We have analyzed the source code of the 50 projects selected from the Pharo ecosystem and reported our findings in this document:
https://hal.inria.fr/hal-02440055v1 https://hal.inria.fr/hal-02440055v1

Perhaps, you will find it interesting.

Here are some fun facts that we have discovered:
25% of classes have no more than 3 methods in them, and 50% of classes have no more than 6 methods
The average number of lines of code in Pharo methods is 5.8, the median of this number is 3, meaning that 50% of methods have no more than 3 lines.
About a quarter of source code are message sends (method names) - they take 27.3% of source code tokens and 26.3% of characters.
On a character level, 22.5% of source code are string literals, and 19.4% are literal arrays. Together literals take 44% of characters in source code, but only 7.1% of tokens.
Positive statements are much more common than negative ones. ifTrue: is used 3 times more often than ifFalse:. Similarly, ifTrue:ifFalse: is 26 times more common than ifFalse:ifTrue:.
After tokenizing the code of 151,717 methods, splitting identifier names by camel case, and removing non-alphabetic characters, we received a sequence of almost 3 million words (e.g. ... ordered collection with all command line arguments...). This sequence contains only 8,211 unique words (including all misspellings such as arrray, clipped words such as arr, and nonsense words such as ddd or xdkh). Compare this to over 40,000 unique words used in roughly the same amount of printed English prose.
At least 5,480 of those 8,211 unique alphabetic sequences are valid English words.

Have a nice day, and let us know what you think.
We would be happy to receive your feedback.

Oleks


Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org
03 59 35 87 52
Assistant: Julie Jonas
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley,
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France

> Begin forwarded message: > > From: Oleksandr Zaytsev <olk.zaytsev@gmail.com> > Subject: [Pharo-dev] Characterizing Pharo Code: A Technical Report > Date: 15 January 2020 at 12:02:06 CET > To: pharo-dev@lists.pharo.org > Reply-To: Pharo Development List <pharo-dev@lists.pharo.org> > > Hello, > > We have analyzed the source code of the 50 projects selected from the Pharo ecosystem and reported our findings in this document: > https://hal.inria.fr/hal-02440055v1 <https://hal.inria.fr/hal-02440055v1> > > Perhaps, you will find it interesting. > > Here are some fun facts that we have discovered: > 25% of classes have no more than 3 methods in them, and 50% of classes have no more than 6 methods > The average number of lines of code in Pharo methods is 5.8, the median of this number is 3, meaning that 50% of methods have no more than 3 lines. > About a quarter of source code are message sends (method names) - they take 27.3% of source code tokens and 26.3% of characters. > On a character level, 22.5% of source code are string literals, and 19.4% are literal arrays. Together literals take 44% of characters in source code, but only 7.1% of tokens. > Positive statements are much more common than negative ones. ifTrue: is used 3 times more often than ifFalse:. Similarly, ifTrue:ifFalse: is 26 times more common than ifFalse:ifTrue:. > After tokenizing the code of 151,717 methods, splitting identifier names by camel case, and removing non-alphabetic characters, we received a sequence of almost 3 million words (e.g. ... ordered collection with all command line arguments...). This sequence contains only 8,211 unique words (including all misspellings such as arrray, clipped words such as arr, and nonsense words such as ddd or xdkh). Compare this to over 40,000 unique words used in roughly the same amount of printed English prose. > At least 5,480 of those 8,211 unique alphabetic sequences are valid English words. > > Have a nice day, and let us know what you think. > We would be happy to receive your feedback. > > Oleks -------------------------------------------- Stéphane Ducasse http://stephane.ducasse.free.fr / http://www.pharo.org 03 59 35 87 52 Assistant: Julie Jonas FAX 03 59 57 78 50 TEL 03 59 35 86 16 S. Ducasse - Inria 40, avenue Halley, Parc Scientifique de la Haute Borne, Bât.A, Park Plaza Villeneuve d'Ascq 59650 France