send, receive, reply: Performance is not optional

Monday, September 18, 2006

Performance is not optional

Interesting read here, on languages and choice and the conclusion is even right (the ability to mix and match languages, with proper interop will be key - frankenstein solutions need not apply). More on that some other time but referenced from that blog, I saw this which did some ~~bashing~~ comparisons of C/C++, Python and Ruby performance and my curiosity got the better of me ... I had to try my Smalltalk implementation (VisualAge Smalltalk) to compare.

The Ruby benchmark shown by the author is shown here (testlist.rb) . On the authors machine, it completes in "~100 secs". In comparison the author's C++ code took ~7 seconds and you can read his entry to determine his conclusions on the suitability of Ruby or Python for some applications. So, to compare and set a baseline on my own hardware, I first ran the Ruby code on my laptop and it completed in 111 seconds which for this purpose will be considered close enough to ~100 seconds to claim that our machines are roughly equivalent (I do know better but for this purpose... I declare it "close enough for government work" so I just squint a little and I can do a direct comparison to the authors C++ numbers.

I translated the Ruby code to Smalltalk, trying to keep the code as close as possible to the Ruby version. Here it is.


| l start stop |
l := OrderedCollection new.
start := Time millisecondClockValue.
1 to: 5000000 do: [:x |
0 to: 9 do: [:i|
 l addFirst: i.
].
[ l isEmpty ] whileFalse: [
 l removeFirst]
].
stop := Time millisecondClockValue.
stop - start

On the same machine it runs in (drum roll please ...) 7.2 seconds which is umm, "approximately" 7 seconds and comparable to the optimized C++ numbers quoted by the author. (For complete truth in advertising, if you use different permutations of #addLast, #removeLast, the times can go up to about 9 seconds).

What does this tell me ? Clearly, Smalltalk is as fast as C++ ? Wooo hooo... sure, and sometimes faster, and ok, sometimes slower too. What does it really show ? That Ruby and Python have a way to go performance wise and if you apply modern implementation techniques like the ones used in commercial (and some open source) Smalltalks, improved performance is well within reach. Using a Smalltalk VM for Ruby has been discussed here and with the release of Strongtalk, this is a possibility.

Anyone want to guess what the numbers for the same benchmark is in Java ? (Hint - think linked list of primitive types).

So, despite some of the crazy rhetoric you have to wade though these days, one of the nice things that has happened as part of the recent rise in interest in dynamic languages are these performance comparisons, which in turn is leading to a rediscovery of the "old ways" :). This can only be good for all. Remember kids, "performance is not optional" (and neither is interop).

15 Comments:

Rick DeNatale said...: Hello John!

I just found your blog via Pat Mueller's Planet OTI. It's good to see so many OTIers blogging.

This article got me off my butt about writing about Ruby performance on my Ruby-oriented blog.

Here's the result, which is quite frankly riffing on your theme.; 3:58 PM
adas said...: Here are Java results:

Sun (Java HotSpot(TM) Client VM (build 1.5.0-b64, mixed mode)
: 19 sec

IBM J9 (IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux x86-32 j9vmxi3223ifx-20060124 (JIT enabled)
) : 91 sec

BEA JRockit(R) (build R26.0.0-189-53463-1.5.0_04-20051122-2040-linux-ia32, ): 73 sec

Here's the code
long start= System.currentTimeMillis();
LinkedList l= new LinkedList();
for (int x = 0; x <= 5000000; x++) {
for (int i = 0; i <= 9; i++) {
l.addFirst(i);
}
}
while(! l.isEmpty()) { l.removeFirst();}
long stop= System.currentTimeMillis();
System.out.println(stop-start);; 3:22 PM
John Duimovich said...: hey adas, Thanks for the Java code. Kind of suprised me cause I actually had different numbers in hand with some code of my own.
Then I noticed your code was translated incorrectly. You have it adding 50 million elements and
them removing them all. When I ran it on hotspot I got an out of memory.. That clued me in :) ...

Anyhow, the original benchmark add 10 elements, removes them all and does that 5 million times.

Here is the ruby code that was initially used.
l = Array.new()
start = Time.new()
for x in (1..5000000)
for i in (0..9)
l.insert(0,i)
end
while !l.empty?
l.pop()
end
end
stop = Time.new()
puts stop - start

Amazing how a misplaced bracket changes things :)

Anyhow, I include the code I wrote below and my results. I ran on Windows so if you want to rerun my version on Linux (I doubt we'll see a difference).

Originally, I wanted to experiment a little so I wrote the bench to allow use of ArrayList or LinkedList. Using an interface does cost you (see times) so I also added a concrete class like in your version and re-ran it.

so, results here ...

Java HotSpot(TM) Client VM (build 1.5.0-b64, mixed mode)
LinkedList Time: 4316
List Interface - LinkedList Time: 7481
List Interface, ArrayList Time: 9193

IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Windows XP x86-32 j9vmwi3223-20051103 (JIT enabled)
LinkedList Time: 3265
List Interface - LinkedList Time: 4556
List Interface, ArrayList Time: 6319

--- file B1.java ---

import java.util.List;
import java.util.LinkedList;
import java.util.ArrayList;

public class B1 {
static public void main (String args[]) {
long start, stop;

start= System.currentTimeMillis();
LinkedList l = new LinkedList();
for (int x = 0; x <= 5000000; x++) {
test(l);
}
stop= System.currentTimeMillis();
System.out.println("LinkedList Time: " + (stop-start));

start= System.currentTimeMillis();
List list = new LinkedList();
for (int x = 0; x <= 5000000; x++) {
test(list);
}
stop= System.currentTimeMillis();
System.out.println("List Interface - LinkedList Time: " + (stop-start));

start= System.currentTimeMillis();
list = new ArrayList();
for (int x = 0; x <= 5000000; x++) {
test(list);
}
stop= System.currentTimeMillis();
System.out.println("List Interface, ArrayList Time: " + (stop-start));

}

public static void test(List l) {
for (int i = 0; i <= 9; i++) {
l.add(i);
}
while(! l.isEmpty()) {
l.remove(0);
}
}

public static void test(LinkedList l) {
for (int i = 0; i <= 9; i++) {
l.addFirst(i);
}
while(! l.isEmpty()) {
l.removeFirst();
}
}

}
----; 9:33 PM
adas said...: Thanks for noticing the typo, John.
I ran all 3 VMs with -Xmx2048M

Still strange that I saw such diffs in time. I ran the same code on all 3 VMs; 11:30 PM
John Duimovich said...: adas... yeah my laptop *only* has 1 gig (can I really say *only* in reference to a gig ?) so your version of benchmark is a 2 gig required disk test for me :)

brian... agreed. the bench is not particularly fair or even an accurate portrayal of the kind of performance you will get in your app. Benchmarks are tricky this way and a usually micro benchmarks like this one are not going to represent your apps performance. I was kind of worried when I posted the initial smalltalk version, basically I wanted to see how the Smalltalk VM we had built basically 10 years ago (and I still use today) did against Ruby etc. Pretty well I say. I didn't want to nor will I claim this is the end all be all of benchmarks. It basically is what it is. And, as you note, the various implementation details across libraries will skew the results. Probably the most important thing to note from this thread is that algorithms matter and as you note, an O(n) vs O(1) is gonna make more difference than language implementation.

Besides, we spent all our old days in Smalltalk optimizing recursive fibonacci :); 7:25 PM
skrishnamachari said...: On visualworks 7.4NC

with the same VisualAge code..

gives me 5050 ms. Impressive compared to others..; 8:14 AM
John Duimovich said...: skrishnamachari

Thanks for running it in vw7.4NC. 5 seconds sounds pretty good.

Did you happen to run the Ruby numbers on the same machine too ? To be able to compare your 5 second number, you'd need to have a common reference on the same machine to compare it's relative speed. I used Ruby because the original bench had ruby numbers (they were 110 seconds on my machine which told me it was approximate (withing 10 percent) of the machine Christopher used) so the numbers could be compared.

For the record, I'm on a Pentium 1.8GHz with 1G RAM, Windows XP Pro.

What class of machine did you use ?; 9:29 AM
skrishnamachari said...: My Machine: Dell Inspiron E1705 Duo Pentium 1GB RAM

VW: 5.050 s
IdlePython: 76.8274772352 s
FreeRide Ruby: 60.78 s
Eclipse Java: 18.640 s

The code is quite possibly different in every implementation in sense not using the optimal class/ method call..so cannot be termed apple to apple comparison.. but the order of magnitude..; 4:19 PM
John Duimovich said...: Sweet machine.

From the Ruby numbers, looks like your machine is about 40% faster than my hardware.

I won't be as presumptuous to scale my 7s VisualAge numbers :) but your point is well aligned with the one I made - the commercial Smalltalks (VisualAge, VisualWorks) are pretty fast and they do pretty well. As you point out, we should not read too much into the actual values, apples to apples and all that, but as an indicator of order of magnitude, yep, good data.

The good news for Ruby is that these implementation techniques will work on it as well so I can see improving Ruby implementations in the future. If people are happy with Ruby perf now, then they will be very happy as these kinds of performance improvements make it into Ruby releases.; 9:32 PM
Isaac Gouy said...: ...one of the nice things that has happened as part of the recent rise in interest in dynamic languages are these performance comparisons...

The Computer Language Shootout Benchmarks includes several Smalltalk implementations as-well-as Ruby and YARV.

Besides, we spent all our old days in Smalltalk optimizing recursive fibonacci :)
Maybe you'd like to try again ;-)

The Smalltalk recursive programs seem much worse than Smalltalk performance on the other benchmarks; 3:09 PM
John Duimovich said...: Isaac,

Thanks for the link to the computer language shootout site. I've seen the site before but good to have a reminder.

"try again" ... good plan... would be more fun than my every day routine.; 8:47 PM
James McGovern said...: Could you in your next blog entry shed some light on Ruby on Rails and its security model?; 3:37 PM
Rick DeNatale said...: Hmmmmm
I've been looking at this from the point of view of how Ruby implements Arrays, and I noticed something which might or might not be important.

Besides using an Array rather than a linked list, the Baus Ruby benchmark doesn't even do the same thing as the C++ benchmark. In C++ the list is being used as a stack, items are added to the head of the list and removed from the head.

In the case of the Ruby benchmark, it inserts elements at the front of the Array using insert(0, obj), then removes them using pop, which actually removes the last element. So instead of a stack we have a queue.

The Python code has a variation on the same problem, it uses append which adds to the end of the list, but del(0) which removes the first element.

Now I know that in benchmarking only performance matters, so I guess the old adage Make it work right before you make it work fast doesn't necessarily apply.

Having said all that, another aspect of this is that the benchmarks take a rather naive approach in the use of the classes. For example, using Array#insert in Ruby is overkill for adding to the beginning or end of an array. More targeted methods such as push, and unshift exist which are at least marginally faster because they have less cases to consider.

I've got a few more thoughts on this over on talklike aduck; 12:52 PM
paul said...: The java benchmarks are pretty flawed here. You've successfully timed the JVM's doing dynamic compilation along with the benchmark.

If you were really looking for a resonable test, you could warm up the VM first and make sure the -server option is on.

Secondly, using the standard linked list is quite easy, but it would take about a 15 line inner class to make a primitive based linked list.

That would of course save several million object creations. Overall, this java code was setup to fail.; 9:45 PM
John Duimovich said...: Paul Tyma ...

heh. of course this is a crappy benchmark - lets discuss why (and why not) ...

It's not because we didn't warm up the machine once and then run a second loop so that we could avoid timing the dynamic compilation time. Compilation time is a fact of life and users of a JVM will pay the price whether you measure it or not. Actually skipping measuring it is kind of like cheating but you already know that.

Also, customers want to know how fast something runs the first time without warmup cause they typically do not want to run things twice... kind of defeats the purpose. When they call us up for performance reasons, we don't tell em to run it again and measure it then. They eat the compilation time... so should we.
(Note - Benchmark publications are different and a set of rules ensure common comparable results - typically allowing a warmup).

and besides ...

The JVM should be able (and J9 does) figure out whats hot in this benchmark and it does optimize and it does make the overall time significantly better ... making excuses for a VM not being warm is just silly.

FYI -server option is a hack for VMs which can't figure out what kind of workload you have automatically.

Now... to your recommendations on writing your own linked list ! Oh, please - sure anyone could do this but the only reason you need to do this is that Java generics are fake (type erasure) and as such they don't apply to primitives. Then they fake you out with autoboxing and innocent looking code results in the million of allocations you refer to. Yes, Java has some suckage there but the answer should not be "write your own version of linked list" regardless of how easy it is.

Anyhow, benchmarks are not applications. These numbers are just that. One set of numbers and they point out some things about Java that you may not like to hear. You will note the original point was comparing Ruby and Smalltalk and showing what was possible in a dynamic language.; 11:48 PM

send, receive, reply

Monday, September 18, 2006

Performance is not optional

15 Comments:

About Me

Previous Posts