Saturday, October 23, 2010

Ant DirectoryScanner memory problems with large directory trees

I encountered a very interesting build problem this week. Hudson running out of memory when trying to parse test reports. It wasn't actually even getting to the parsing part though. Thread dump looked innocent, Ant...hmm... so I took a heap dump and examined that a bit. Top memory use? Ant's DirectoryScanner. Ant's DirectoryScanner was running out of memory just trying to round up the test report XMLs. Even with a 1.3GB max heap, it was not sufficient to scan our workspace anymore. Peculiar.

From looking at the heap dump, DirectoryScanner holds on to unincluded directories. This doesn't seem right to me... I may dig further when I have time. Simple searches showed some archived mailing list discussions, but I don't think it's related. Ant 1.6.5, 1.7.0, 1.7.1, 1.8.1 all observed this behavior for me...so I do not think it is a regression. A simple ant goal run on this workspace - pure ant, couldn't scan the dir until it had a 1400m heap, and to do so took it 96 minutes because it was still memory constrained.

The pattern being used was fairly innocent looking:
koko/dev/projects/**/target/test-reports/TEST*.xml

Except it isn't that innocent when the root workspace dir has 300k+ files, and 1,000,000+ directories in it, with some very long path names on top of it. As it turns out, that ** is expensive and should be used with great care. Even though it is only traversing our code projects, it is still a very large directory tree and it runs out of memory.

More important than the workspace dir count, which should be getting skipped with the start of the pattern, the dir count within /projects/ is still almost 300k dirs.

Maybe this is a newbie mistake with Ant...but I'm not so sure. There are times where ** may be necessary, and I have a newfound awareness of how expensive this can be (in time taken, obviously, but more importantly in memory - it's no good if it fails because it runs out of heap...)



For a smaller directory tree on my dev machine, I wrote a short ant target that would touch 1 file, found using an include pattern similar to the above. The ** version took 24 seconds to execute on an Intel x25-m 160GB (G2) SSD. the * version took 0 seconds. The time penalty of ** would be much larger on slower, traditional hard drives...

Bear in mind, for Hudson users - Findbugs, Warnings, and other plugins typically use Ant include patterns (and I am sure Ant's DirectoryScanner), and in those cases sometimes ** is unavoidable.

I need to fire up a Hudson workspace and ensure that it supports multiple includes, so users (me!) can avoid ** in more cases...

Monday, August 23, 2010

Don't Filter Without Being Aware of the Scunthorpe Problem

I just happened to stumble across this question today while browsing StackOverflow. In it, a user is searching for a library that will filter profanity. Some suggestions are posed - as well as a problem with one of them. The Scunthorpe Problem. The gist of it is that naive filtering implementations will often filter innocent words that contain sequences of letters that are considered dirty by themselves.

I highly recommend reading the full wikipedia page - it is highly entertaining.

If you ever find you need to implement a profanity filter - make sure you don't forget about the Scunthorpe problem, or you may cause your users unnecessary strife.

My favorite word resulting from the scunthorpe problem: buttbuttinate.

Thursday, August 5, 2010

More on e-reading, and the Pragmatic Bookshelf

I had some ranting the other day on my first experimentation with e-reading, but afterwards I learned something cool to offset the potential downer. I had completely forgotten that I have PDF versions of lots of books from the Pragmatic Bookshelf.

I love the Pragmatic Bookshelf. Not only is it full of awesome books, but books I bought 5 years ago in print+pdf I can now convert and put on my iPad at no additional charge. This is the definition of sweet. All I had to do is log into my account, tell the gerbils which books I wanted in what formats, and waited a minute. There are a couple of old ones that are still only available in PDF - but I can still view those on my iPad, it just isn't as convenient yet. The Pickaxe is a notable example...but I'd say more than 75% of my books had e-reader formats available. Again, iOS 4 for iPad can't come soon enough - in this case for iBooks PDF reading.

At a bare minimum, I know that I can enjoy a good set of books on my iPad going forward. Exciting. Especially since there are still a couple I haven't read yet, and some I intend to re-read soon. There are countless more that I desire...

It's high time I create a bookshelf on this site, come to think of it... I'll put one up in the next week.

Tuesday, August 3, 2010

Adventures in E-reading

I read my first book and a half on the Kindle reader for iPad last weekend.

Specifically I enjoyed:
  • Masters of Doom - a great book primary about John Carmack and John Romero. If you are a fan of the genre, it's a great read. I stayed up past 4am reading, it was so engrossing. I remember being on the Software Creations BBS, which is mentioned in the book.
  • Peopleware - Productive Projects and Teams (Second Edition) - a highly recommended book. I'm only half way done, so a full write up will have to wait, but it is a very insightful (and also quick) read. 
Why did I choose the Kindle reader over iBooks? Kindle has a much better selection of books. So far iBooks is 0 for 3 on books I have wanted. Also - the Kindle reader is available on other platforms.

I love reading on the iPad. I can read with a light on - or without. The text was very readable and my eyes did not get put off by the backlit screen even after reading for 5 hours straight as I tore through Master of Doom.  Even though I already owned Peopleware in print, I paid the 10 bucks for the Kindle version anyway. That's a testament to how much I enjoyed reading on the iPad.

All is not well in the land of e-reading though. While I had no quality issues with Masters of Doom, Peopleware seems to be full of missing punctuation, at least one blatant typo, and a chapter that is in the wrong place. I did a quick search - Apparently I  am not  the  only one. Since I have the print version, I have verified that none of the issues I have seen occur in my print copy. What is going on? Could it be that physical books are being OCR'd, or worse, re-typed by hand? What on earth??? Shouldn't digital copies already exist at the publisher?

Here is an example from the Table of Contents in Peopleware:
Actual TOC from the Paperback copy
Kindle version
The chapter appears in the wrong place in the book. Intermezzo should be between chapters 9 and 10, not between chapters 8 and 9. Also, less importantly, the titles seem to have been truncated, and in the case of #8, the quotes are missing.

I found at least several instances of missing punctuation while reading so far, as well as one glaring typo:
"The only acceptable interruption there was a fire alarm, and it had to be for a real Tire."
Somehow a lowercase f turned into a capital T...

Maybe some more veteran e-readers can tell me if they run into this a lot. I find it seriously distracting when sentences are incomplete or I find typos in books. Especially if they are an artifact of the e-book translation, and not something the original editors missed.

Is this the state of ebooks these days? I hope not, or my adventures in e-reading will be short lived.

Monday, August 2, 2010

Stage. Stage. Stage. Always Stage.

It doesn't matter what software we're talking about.
  • MySQL
  • Hudson
  • Java releases
  • thirdparty libraries
  • Subversion client
  • you name it - if it can be upgraded or replaced, it can and needs to be staged.
 Always stage. ALWAYS.

If you don't stage, you're just punishing yourself (and potentially your coworkers and/or users). So be smart - always stage your upgrades. You'll get burned if you don't. It's just Murphy's Law applied to technology, really.

This isn't in response to recently getting burned myself, it's been a post that's been brewing for a while.

Many years ago, a friend of mine once wondered why an admin wouldn't upgrade PINE (a popular terminal email client at the time) as soon as it came out. Now I know why - because that admin was wise. They wanted to be sure it worked and didn't cause any regressions before subjecting their users to it.

I am continually surprised not just by how many things can go wrong, but also how frequently they do, despite the best efforts and intentions of developers. Just the other week, the tinest name change by Oracle in JDK 1.6.0 Update 21 caused Eclipse to fail to launch. Who'd have thought? And yet it happened. There are numerous other examples. I am sure you can think of plenty in whatever software you use. I can certainly think of several, just in the past couple of months.

Before your whole office upgrades to the latest version of Visual Studio, or the latest Subversion client, or any other software. Make sure someone tests it out first.

Stage, because anything that can go wrong will go wrong.

Friday, July 30, 2010

iPad Thoughts

I bought an iPad in early May. This post is a summary of my thoughts when i first got it, up till now, 3 months later

Opening Thoughts and The First Month

First off - you really need to use an iPad to appreciate their coolness. They are incredibly slick.

I found immediately after getting an iPad that I almost never needed to turn on my desktop at home. I also didn't really need to turn on my laptop for much. It satisfies most browsing, email, RSS, and other media needs. The battery life is incredible, especially for watching video. You can read books from iTunes, Kindle, and Nook stores. There are lots of fun games for it. It's a great web browser.

It has really changed how I consume media. Let's not forget it also does Comic books - although selection is still not excellent, I have enjoyed one or two series that way.

My iPad is my first choice entertainment device wherever I go. It is also my note taking device in meetings at work. It's handy for note taking + browsing our issue tracker + continuous integration + email all in one. It rocks.

Favorite Apps

Below are some of my favorite apps.
  • The Early Edition
  • Twitterific
  • Comixology
  • Instapaper
  • Netflix 
  • ABC Player
  • We Rule 
  • Cogs HD
  • Plants vs Zombies HD
  • Leap Sheep!


Problems

The iPad is a device I use every day, but there are a couple of clear annoyances popping up.

  1. The iPad needs the iOS 4 update BADLY. It feels like it's a second class citizen now that the iPhone 4 is out and has iOS 4 and multitasking. Apple really better have some extra features up their sleeve, or the wait just is not justified (and is crummy for early adopters)
  2. It needs flash. As much as I dislike flash, I still hit websites that require it - some sites even have videos, half of which are HTML5, the other half of which appear to be flash and won't play. This is an issue for both Apple and Adobe to sort out. Adobe because I still don't think they (although as I write this, Froyo is finally out on some phones potentially) have a good, working flash implementation out on smartphones. Also - where's Linux flash support these days? 64-bit anyone? You can't claim it "just works" when basic platforms have had problems for YEARS. Also, 

 Summary

The iPad is an amazing device. Considering it is not only a first generation device, but also the first of its class - a true tablet with great battery life for browsing, reading, and watching video, I am definitely impressed. It isn't perfect - but iOS 4 should remedy my biggest issue, and it can't come out soon enough. Being able to multi task on my phone but not my iPad just isn't right - it should have launched with it. My hope is that printer support and/or a more sophisticated multitasking are coming. Time will tell...

Tuesday, July 13, 2010

Browser Benchmarks - July 8, 2010 - Ready. Set. Fight!

Update: More recent benchmark

Yesterday my Opera browser at work auto-updated to 10.60. Opera's auto update is finally coming in line with Firefox...still not quite as nice, but it's getting there. More interesting was the dev blog about 10.60, and I saw elsewhere that Firefox 4.0 beta 1 was out, and also checked up on IE9 to see how things were progressing. Somewhere along the way, I also found the Peacekeeper benchmark and decided I wanted to benchmark some browsers to see what the current state-of-play is. I will be using that for overall (HTML5 / DOM / Javascript) performance, and Sunspider for pure Javascript performance. I'm also throwing ACID3 in there for a sense of where each browser is standards-wise. 


Environment:
System: Intel Core i5 750 @ 3.36GHz, 4GB RAM, 80GB Intel SSD (G1), ATI Radeon HD 4850
OS: Windows 7 64-bit Home Premium
Fresh boot. No other apps or system tray programs running aside from Microsoft Security Essentials.
Each browser was run by itself, with only one tab for the benchmark itself.

First off, I'll start with Sunspider. These are the final numbers, but I have linked the full results for each. I don't have any fancy graphs, so I will order them fastest to slowest.

  1. Chrome 5.0.375.99 - 224.0ms +/- 2.0% [Full Result]
  2. Opera 10.60 final - 231.4ms +/- 1.5% [Full Result]
  3. Safari 5.33.16.0 - 273.4ms +/- 2.3% [Full Result]
  4. IE 9 Preview 3 (1.9.7874.6000) -  293.6ms +/- 0.7% [Full Result]
  5. Firefox 4.0 Beta 1 - 406.8ms +/- 1.9% [Full Result]
  6. Firefox 3.6.6 - 575.6ms +/- 1.1% [Full Result]
  7. IE 8 - 3555.8ms +/- 0.6% [Full Result]
Chrome has the lead on pure Javascript performance with its V8 Javascript engine, but Opera is not far behind. It's only trailing by 3.3%, barely more than the margin of error. Next up is Safari, taking 22% longer. IE9 Preview is showing great promise, at 31% behind Chrome. 

Firefox 4 beta is not in the same ballpark currently, at just more than half the speed of Chrome. It is a good increase over Firefox 3.6.6, which would've been almost a 3x difference, but it is a ways off. The big takeaway is that the IE9 team appears to now be ahead of the Firefox 4 team on Javascript performance...times are certainly changing.

IE8 is included as a baseline, and because a lot of users are still on IE.

Next up is Peacekeeper, for this I have screenshots and the full results here.



Opera 10.60 leads here overall. Chrome is close, at 16% lower score. It's worth noting just how far ahead Chrome and Opera are over every other browser in the overall category. Safari 5 is next, at less than half the score. Firefox 4b1 is on the heels of Safari, and then there are the rest.

What's interesting to note here is that even though Safari & Chrome are both webkit based browsers, clearly the Chrome team is going the extra mile on performance.

I want to drill into what makes up these scores, as this benchmark is new to me, but I think that will be in a follow-up post as this post is already almost a week late if you look at the date..


Finally, lets look at ACID 3.

  • Chrome 5.0.375.99 - 100/100
  • Opera 10.60 - 100/100
  • Safari 5  - 100/100
  • Firefox 4 Beta 1 -  97/100
  • Firefox 3.6.6 - 94/100
  • IE 9 Preview 3 -  83/100
  • IE8 20/100 (FAIL)

Chrome, Opera, and Safari all receive full marks. I did not compare pixel to pixel, but they have had good track records with ACID tests. Firefox 4 Beta 1 is getting close to passing, slightly better than Firefox 3.6.6. IE9 has come a long way from IE8 but still has a ways to go in order to pass ACID 3.

Regardless of your preferred browser, there is some intense competition now in Javascript engines and general rendering performance, resulting in the experience improving for everyone. It is an exciting time, and I don't think you can go wrong with Chrome, Firefox, Opera, or Safari. Rendering and performance are more than good enough in all of them, so it comes down to other usability / features. Once IE9 is closer to final, I may even be able to recommend that.

Update: More Recent Benchmark