Tag Clouds: See How Noisy Your Code Is

If you follow this blog then you probably know that one of current interests is expressive design, either using Domain-Driven Design or Domain-Specific Languages. Here is a presentation about this topic.

One of the tricky things about expressive code is that it is very hard to see how noisy a code base is. What I found in my experience is that there is a clear relation between the how often words coming from the Ubiquitous Language appear in the code and how clear it is.

Thinking about a tool to make matters explicit I decided to try using Tag Clouds. I love them, they are extremely powerful tool to describe what a website is about and I am pretty sure they can tell us one or two things about a code base.

So I got the source code for two projects. The first one is considered by the team that worked on it a really noisy implementation. It takes at least two weeks for someone to be able to understand the very basics of how the system works and productivity in this team is extremely low. A major refactoring is expected to sole the mess.

The other one is not a perfect project but the team thinks it is not that bad regarding noise. The project did not have a huge budget so it had developers coming in and out from the team all the time. A newcomer could understand the basics of a system quickly and was good enough to sign-up for a card by herself in a couple of days.

Both projects have about the same size, were developed using pair programming, TDD and pretty much all other XP practices. They were developed over one year by very similar teams of senior developers, and both have complex domains.

I used Wordle, a fantastic free tool, to generate Tag Clouds for them. To get better results I removed all comments and all string literals from the code base. I also anonymized the data to protect the innocent and the guilty.

Here is the Tag Cloud for the first project:

(See larger)

And the cloud for the second one:

(See larger)

Can you tell me what the first project was about? And the second?

It is clear that Java keywords will be more frequent. It is the expected in such a verbose language (I am dying to try this in a Ruby project). But it is also expected that at least some of the words from the domain will be visible from this diagram.

The first project was about cars but I don’t think anyone can tell this by the picture above. The only clue is a carId that appears in the left side, it reflects the fact that this system passes that ID around instead of references to instances, i.e. instead of:

public void doSomethingWith(Car carToDoStuffWith){
 //...
}

You have:

public void doSomethingWith(String carId){
 //...
}

The second one may be a bit clearer. The code is about managing large professional printers that print pictures. You can see printing device, picture, product, paper, TestingPrinter, photo and several other terms that are part of the ubiquitous Language in it. Can you tell a lot about the domain just by that? Of course not and that’s not the goal. What the diagram shows is that there is a better balance between the noise required by your platform (Java) and domain-specific code.

I have tried this with multiple code bases and I could always get some interesting overview of expressiveness in them. Like most visualisation tools it is not a scientific proof of any kind but it gives you a hint about how good or bad your code base is.

13 Responses to “Tag Clouds: See How Noisy Your Code Is”


  1. 1 Colin Jack Apr 29th, 2009 at 5:52 pm

    Great idea.

  2. 2 Ivan Sanchez Apr 29th, 2009 at 6:17 pm

    I’ve played with a similar idea last month and put some code together to generate tag clouds only for file names in a project:

    http://github.com/s4nchez/atmosphere/tree/master

    The results were very interesting. It reflected exactly the terms we use when talking to the client, with the most important ones appearing way bigger than the rest.

    My next step now would be link those terms to people who worked on that part of the code and generate a “sub-cloud” to show how spread is the knowledge about particular business concepts.

    About your examples, it seems the language keywords are the biggest noise. Did you think about filtering them and get only the words people are actually creating?

    Anyway, I’m really happy knowing I’m not the only crazy trying to do this kind of stuff.

    Cheers!

  3. 3 Tiago Fernandez Apr 29th, 2009 at 6:49 pm

    Very good idea indeed. It would be also nice if we could filter language’s keywords, in order to have a clearer domain view… but I think Wordle does not support word filtering (yet?). Maybe a dedicated tool for programming languages would worth a shot.

  4. 4 Ícaro Medeiros Apr 29th, 2009 at 8:50 pm

    Why not to exclude reserved words of the cloud?

  5. 5 Rafael Peixoto de Azevedo Apr 29th, 2009 at 8:54 pm

    Thanks for this outstanding post!

    Great idea: highly useful, effective and simple application for word tags.

    Congratulations!

  6. 6 Alberto Souza Apr 29th, 2009 at 9:44 pm

    Great post!!!!!

  7. 7 Phillip Calçado "Shoes" Apr 29th, 2009 at 9:53 pm

    About the filtering of Java keywords:

    There are -at least- two different ways to use the cloud. The first one is what I did, to try to get the ratio between noise caused by the language and signal in the code. This is useful to show how DSLs attack noise, for example. If you remove the keywords you won’t find it.

    The other approach I see is to do what you guys said and filter out known keywords and maybe other known entities - framework classes, for example. This is extremely useful in a different way: it tells you what the ubiquitous language looks like. Using this approach you can even try to study how Bounded Contexts are used in an application and how cohesive a system is.

    I am working right now in trying to write something about those multiple uses. The hardest part is to find some good public code bases to explore.

    @Ivan
    I’m writing code too, maybe we should try to do something together. Still need some prototyping time alone, though, but I’m following your repo.

  8. 8 Rob Hunter Apr 30th, 2009 at 7:03 pm

    I wanted to run this myself but had trouble running the Wordle app from local input.

    How did you do it?

    I started by concatenating all the code files together[1] and pasting the results into the Wordle “Paste a bunch of words here”.

    I eventually discovered the Wordle Advanced Tools and wrote a monster command-line[2] to find the top 200 words in a codebase.

    In my sample of two:
    * A Rails codebase (just the “app” folder) — relatively domainy, relatively few “programmer” words like “def”
    * The major section of a Java codebase: almost no domain words at all in the top 50 :-(

    I believe the result is an accurate reflection with how generally expressive the code is in each codebase. The Java one makes heavy use of “String” and “get” and other machinery words.


    [1] Concatenate all Java files together

    find . -name '*.java' | xargs cat

    [2] Identify the 200 most-used words in across all Java files (treating camelCase as two words)

    find . -name '*.java' | xargs cat | tr -s '[[:blank:][:cntrl:][:punct:]]' '\n' | grep '[[:alpha:]]' | sed s/'\([[:lower:]]\)\([[:upper:]]\)'/'\1 \2'/g | tr '[[:blank:]]' '\n' | sort | uniq -c | sort -n | sed s/'^\s*\([0-9]*\) \(.*\)$'/'\2:\1'/g | tail -n 200

  9. 9 Phillip Calçado "Shoes" Apr 30th, 2009 at 8:10 pm

    Rob,

    I concatenated the code in a similar fashion (aggregated using ack) and just pasted code straight from emacs. It was around 64K lines for each example and I had no trouble. I user Safari 4.0 and Java 5 (for the applet).

    I thought about using the advanced tools to mark in different colours the multiple domains -technical and business in this case- but was too lazy to actually do this myself.

  10. 10 Gabe da Silveira May 1st, 2009 at 2:32 am

    This is definitely a cool visualization, but let me play devil’s advocate:

    I think the main thing you’re measuring here (aside from language verbosity which you already mentioned) is just how well the variables and methods are named. Even then, a good architecture might have everything namespaced inside a very important module. If the module is cleanly decoupled it may not appear high in the word count at all even if it’s the most important word. Low-level methods and local variables will be favored, which (especially the latter) are the easiest part of the code to understand and have no bearing at all on whether the architecture is good.

    As far as trading in standard language keywords for DSLs, that always sounds good in theory, but what you are doing are trading standard language semantics for something domain-specific which may or may not be confusing, leaky, or non-obvious. With standard language semantics you have a huge ready-made pool of talent. For a DSL to shine the domain should naturally lend itself to a formal description, rather than shoehorning it in for the sake of aesthetics.

    When I think of what goes into a good architecture, neither over-engineered, or under-engineered, well-tested, de-coupled, maintainable and performant, I have a hard time seeing where optimizing the word counts would not push out some more important design criterion. I think it’s a good exercise to create these word (they’re not tag) clouds and make observations, but the moment we start using it as a metric then people will optimize for it, which I don’t think would be win per se.

  11. 11 Henrique May 4th, 2009 at 3:20 am

    Very interesting idea, thank you!

  12. 12 Jonathan Feinberg Jun 7th, 2009 at 11:13 am

    I came for the post about Clojure in GAE, and stumbled upon this one. I thought you might be amused to see the Wordle that results from concatenating the Java source for the core Wordle layout code (as distinct from the Wordle applet, which is 95% UI code). Notice, and ponder, the prevalence of boilerplate copyright blocks (”IBM”, “Corp”, “Copyright”).

    http://www.wordle.net/gallery/wrdl/921427/Wordle_Core_Source_Code

    Also, if anyone is still interested, there’s a command-line version of Wordle available here:

    http://www.alphaworks.ibm.com/tech/wordcloud

  1. 1 Fun with Wordle » Code Musings and Such Pingback on Apr 29th, 2009 at 11:20 pm

Leave a Reply








Creative Commons License

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.