« Flushed | Main | Cap d’Agde 06 1/4 x 2 »

Most Computerlike Player Ever

By Mig on October 30, 2006 12:45 | Permalink | 61 comments

Tags: computers, pseudo-philosophy, world champions

ChessBase has posted the latest attempt to reduce chess to the level of computers. Two Slovenians have produced epic statistics in order to answer "who was the strongest player ever?" (Full original paper in PDF here.) While many of the stats and how they were generated are interesting, and occasionally revealing, it doesn't have a great deal to do with answering that question. It's actually quite a bit more interesting than I thought it was going to be at the start, although there are a few key failings. Not of methodology, just the limitations of computers and statistics analyzing a human game. I was hoping to have time to put together something comprehensive before posting about it, but it's a busy week already. I'll post an article on ChessBase in a day or two when I have time to finish my own in-depth analysis statistics package...

Meanwhile, what this survey does provide information on is which players preferred tactical complexity. Of course the more aggressively and tactically you play the more mistakes you are going to make. Also "mistakes" according to the computer. Lasker, Tal, Alekhine, and to a lesser extent Kasparov, all believed that putting their opponents under pressure was worth at least a pawn and they backed this up consistently. The authors attempt to correct for complexity and style by seeing which players were the most accurate in complex positions (as judged by computer) and that's a fair try. But you can't correct for style and the psychological pressure of facing a Tal sacrifice. But in this survey, results don't matter, and that's saying a lot.

The tempting complexity analysis is actually a sort of trap. It's not just reaching complex positions or how computer-accurately you play when you get them. It's what you want to do when you get there. Some players want to increase the complexity, to increase the dynamic elements and risk and attack. Others are more prone to wanting to keep things under control, to exchange material and/or limit the range of possible mistakes (for both sides). So while it comes only as confirmation that Capablanca was a player of phenomenal accuracy, it also confirms he much preferred clear positions where his positional mastery and technique could win without risking loss. It's also obvious which type of player is going to score better with the computer.

I suppose this is all just a way of saying players play to their strengths and a computer can't help much there. Tal wasn't inaccurate, Tal was Tal. He did what gave him the best chances to win and Capa did the same. Players of world championship level are intimately acquainted with their strengths and weaknesses (Kramnik's flirtation with e4 notwithstanding, although that coincided with the worst of his health issues). This is why results always matter.

I'm sure there were many contemporary players who would score better on this exam than Lasker, Tal and probably even Kasparov (who was tactical but not really a speculative player). It would be an interesting control group to take the three or four top non-champs around each champ and see how they stack up. Andersson and Karpov, for example, or Korchnoi. I don't doubt we could come up with two or three players from each era who would come out ahead of the champions. This seems like an essential element of scientific method since comparing across era is notoriously tricky, as the authors admit. So, who will it be? Nominate at least one "accurate" non-champ to be compared to the champ of his era. Jussupow for Kasparov? Schlechter for Lasker?

Of course quality of moves is also critical, although as long as it's the same evaluation across the board it's a little beside the point. But the assumption that a program (Crafty, Rybka, any) spending a few seconds on each move is at the level of world champion play is dangerous and degrading. Many simply assume that computer=God and any deviance from the computer suggestion is de facto error. A scary thought. The authors actually didn't use time per move, but a fixed search depth. This is "objective" but horrific in terms of quality. It's like evaluating the value of money based on the number of colors in the bill.

As a footnote, the abused Steinitz played in the romantic era for most of his life and of course his games aren't going to stand up well in the blundercheck. If they only used his games from, say, 1886 onward he would be back in the pack, if still at the bottom. Chess quality is a shoulders of giants situation and has also benefited from decades of increasing professionalization. 70 years ago only a handful of players were working on chess full time. Now you have the entire field of Cap d'Agde studying seven hours a day as teenagers. That said, it's not as if the middlegame transition to endgame and endgame play of Rubinstein and Capablanca is going to be surpassed. Aspects of chess are practically finite and therefore room for improvement in those aspects is finite.

61 Comments

Charles Milton Ling | October 30, 2006 3:34 PM

Good comments, Mig.
And I do wonder whether Kramnik will ever play 1. e4 again.

Yuriy Kleyner | October 30, 2006 3:40 PM

We know a lot about improvements in openings. Everytime a new line is trotted out or an old line gets a refutation/improvement the subject matter is endlessly gone over in analyses we read.

My question is, what about endgame technique and middlegame strategies? Have those undergone high degree of evolution? Were there noticeable leaps since advent of computers and just changes in perception famous in history?

edu | October 30, 2006 3:41 PM

It seems to me that the authors of the article basically supposed that, if a computer says something, then it is true.

If my impression is correct, then the article has nothing to do whit the question of "who was the best player". It really has to do with the question of "who was most similar to a computer".

Russianbear | October 30, 2006 3:51 PM

Like I posted in the other dirt, my opinion on this is here: http://www.chessninja.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=8;t=000238;p=1#000004

Basically, I think they had a nice idea. Of course Tal will differ with Crafty more than Capablanca, because Tal sought to complicate things and Capablanca rarely saw an exchange he didn't like. But these guys have thought of that! They introduced the thing which us (people who already thought of doing this amd/or did this - littlefish, myself, and others) didn't think of before: they calculate the complexity of a position. So we can know measure which World champions preffered complex positions and which didn't and we can measure who was best at complex positions and who was best at the simple ones.

However, the results do not make much sense, because they used Crafty which is a very weak engine - it isonly slightly higher than 2600, and may be overrated. And Crafty has probably one of the most "unhuman" styles among the computer programs, too. But I believe if Rybka was used, the results would be much more reliable, since Rybka has much more human approach to the game, and it is rated higher than Crafty by 300+(!) points. But of course they couldn't use Rybka, because it is commercial and the source code is not available. But like I suggested in another DD, perhaps the solution would be to let Rybka examine the moves and use the complexity values as calculated by crafty - I think the results would be much more meaningful than simply using crafty for both.

Mig | October 30, 2006 3:55 PM

I think the Great Predecessors series makes pretty good headway in demonstrating the advances in positional and middlegame play, at least up through Fischer. Of course best plans and such have been refined extensively since then, but Fischer was probably the first true and consistent modern. The next leap was computers and the discarding of dogma. It's almost anti-style today with the first generation of computer-trained teens in Nakamura, Carlsen, et al. Post-modern perhaps?

george | October 30, 2006 4:13 PM

OK, let's all talk about toilets now !! 8-))

Adrian | October 30, 2006 4:16 PM

Yes, I was puzzled the authors' effort to dismiss the (to me) rather serious problem that almost all of the champions evaluated were much stronger then the computer program used to as a benchmark. Their nonsensical reasoning:

"However, altogether more than 37,000 positions were evaluated and even if evaluations are not always perfect, for our analysis they just need to be sufficiently accurate on average since small occasional errors cancel out through statistical averaging."

The trouble of course is that if Crafty is truly only around 2620 strength, its evaluation errors are not "small and occasional" but frequent and sometimes rather severe. Ultimately, 2600 evaluations just can't be the baseline for comparing the strength of 2800 players. It would be like letting Alexader Beliavsky be the arbiter of whether Kasparov or Fischer was greater. It is a farce to pretend that one can identify "the Strongest Chess Player of all Time" this way.

george | October 30, 2006 4:17 PM

(that was a joke actually)

A friend of mine (good player) believes that computers will eventually kill chess. This story started with G. Kasparov I think (it's a matter of coincidence - could happen with another player).

Today, most of the magic has gone with the ability of programs to calculate almost everything.

Myself, since I stopped watching Fritz all the time, I enjoy chess more than before.

Russianbear | October 30, 2006 4:24 PM

Agreed. No offence to Belyavsky. But assuming Crafty is the gold standard will makes this exercise not "who is the best world champion ever", but "which world champion played the most like a 2600-rated program", and thus may actually be a way to calculate the weakest world champion. Well, of course the fact that people who made hardly any blunders would tend to be close to the top, too, so it is not fair to say their lists are actually in reverse order. But of course it is not right to let a 2600 program decide these things. Obviously, if this is an methodology we use to decide who was the best ever (and I think the methodology is the sound one), we need a program that would be consistently better(higher rated, etc) than ALL world champions. It is not clear if such a program even exists, but Rybka does sound like a good option.

Eo | October 30, 2006 4:45 PM

I would like to see some of the uber technicians like Portish, Reshevsky, Flohr, Rubinstein, Averbakh, Reti and contrast them with the Geller, Keres, Korchnoi, Anand, Topalov, Shirov and Morezevich

al | October 30, 2006 4:46 PM

Yuriy,

John Watson's Secrets of Modern Chess Strategy- Advances since Nimzo' could be a good book to read to answer that. I found it a little dry- I find tactics and endings more fun :-)

al | October 30, 2006 4:50 PM

Computers are a tool, no more no less.

Since a car could run faster than a human, athletics has still prospered. I can get from A to B faster by car, but it is often better for me to walk or run to keep the beer gut from expanding too much.

Likewise with chess, a computer is useful tool, which may out-perform me, but it is no substitute for using your brain and the face to face struggle. I don't think my games are worthless, because fritz or whatever picks holes in my choices. However, I hope I can learn from the computer when analysing my games after playing them.

Marc Shepherd | October 30, 2006 4:50 PM

While Rybka may be the engine du jour, and it certainly plays stronger than Crafty, it is nevertheless a computer, and suffers from many of the same limitations as all engines. Many factors influence the outcome of human chess that an engine simply cannot consider, such as psychology, fatigue, pressure, playing style, the match situation, and the past history between two opponents.

The root flaw of the study is that it rates players based on their similarity to a computer, instead of rating them on their results. Replacing Crafty with Rybka wouldn't fundamentally change that.

"I do wonder whether Kramnik will ever play 1. e4 again."

Lest we forget, in Kramnik's must-win against Leko in Game 14 at Brissago, he played 1.e4.

Matej Guid | October 30, 2006 5:44 PM

Hi guys, I'm happy that the article initiated many interesting and constructive discussions around various forums and chess blogs. Indeed it would be interesting to see how would using another (better) engine affect the results. Frederic Friedel of Chessbase.com even suggested us to ask readers to extend the research so that we get a wider base of games. I think it really wouldn't be a bad idea to offer a website with appropriate tools so that the whole chess community would be able to contribute to "the final verdict", even if this approach reflects only one measure of the players' ability.

The solution you propose, Russianbear, unfortunatelly wouldn't be sufficient, since we need more data from the engine. I'll give bellow exact instructions about what would actually be required.

First, the search should be limited to a certain fixed depth d (plus quiescence search). Then, for each position, the program should iteratively search to depths from 2 to d and for each depth the following data needs to be obtained:
- best move (as suggested by computer) and its evaluation,
- second best move and its evaluation,
- evaluation of the move played.
Besides, for each position (from the first move on) the material state of each player should be given.

However, that's not all. Once the analyses are obtained, the appropriate database should be set up and the whole set of scripts or programs should take care of all the required calculations.. That's quite some work, but if anyone has the required time and feels ready to do the job well for the sake of the whole chess community, I'll offer my help. Anyway, if you happen to know the programmers of Rybka, Schredder, Fritz etc. - now it's time to ask for a favour! ;)

From the scientific point of view, our main goal was to offer a carefully chosen methodology for using computer chess programs for evaluating the "true" strength of chess players. The methodology will most probably improve in time and actually it's quite possible that similar approaches will be tried for some other games as well. Note that an analysis that would give us the ultimate answer about who was the strongest player ever in all aspects (including psychology) will never be possible, since we don't have much more information from the past games than the played moves and the results obtained (it would be interesting to know at least the times spent for each move not to mention a myriad of other factors).

Anyway - what is the strongest move in a given position? The one that will win against your opponent? Or maybe the one that would provide the best statistical score in thousands of games? One can easily get lost in such questions. However, in the search for the truth, some methods are more objective than others and even if computers can't give us all the ultimate answers, they can surely answer quite a lot, if used appropriatelly.

Mig | October 30, 2006 5:57 PM

I really don't think the engine is that relevant. It's the same methodology across the board so it should be wrong to the same degree for all players, meaning the stats are still doing what they were designed to do, compare. A stronger engine would be the same degree more accurate across the board. A good thing, but not terribly relevant since all engines are good at the same thing and bad at the same thing when compared to world champions.

I'd be more interested in using the current methodology to test a control group of players. I believe the "strength" hypothesis can be refuted (or strengthened) in this way. Test three or four contemporaries of each champion and see how they come out. Not just players who could credibly be seen as similar in strength to the champions. Just look at the www.chessmetrics.com site for candidates. Test Leko, Gelfand, and Dolmatov, say. If it's an automated process you could just take the top 20 from 1990. If anyone comes out ahead of Kasparov and Karpov we can stop using the word "strength" and call it accuracy or Craftiness...

Charles Milton Ling | October 30, 2006 6:08 PM

Believe it or not, Marc, I *had* forgotten that!
I wasn't implying he should never play it again, I would like to add.
I wonder why he isn't playing 1. Nf3 recently, however. Transpositions galore, of course, but I think the move suits him.
(Before anyone says it for me: Yes, it is really presumptous for someone rated about 5 classes below Kramnik to comment on his style.)

acirce | October 30, 2006 6:28 PM

It's not relevant to this thread, but yes, Kramnik will play 1.e4 again.

Maurycy | October 30, 2006 6:45 PM

The engine certainly needs to be stronger.

However, nobody seems to have mentioned that the sample of games needs to be MUCH bigger!

Modern players (especially) often choose an extremely limited, but deeply researched opening repertoire for a WCC match. They may play a wider range of positions if tournament play is taken into account.

The total number of games is also small, especially in cases such as Fischer and Kramnik.

I think that high-level tournament games from the period in which a player was WC should be included in order to give a truer verdict on a player's strength.

Andrew Dimond | October 30, 2006 6:48 PM

Thank you to the authors of this study; it was absolutely facinating.

The results are very interesting, but a few of the results stand out as rather odd. Notably, Spassky was rated as second only to Capablanca for the lowest level of complexity in his games. This seems odd, since if I read correctly, only games played in world championship matches were used for this study? Fischer ranked as third most complex in style, but the only championship match he played was also one of two Spassky played. If they were playing the same positions, how could Fischer's play have been evaluated as so much more complex?

Another odd result on the complexity rankings: Kramnik is rated as having a more complex style than Kasparov. I am not sure that any human expert would share that opinion.

I also find it interesting that the modern champions (Kramnik, Kasparov and Karpov) consistently finish in the top 4 along with Capablanca. I knew Capablanca's play was strangely close to perfect. I did not know it was (according to crafty at least) more accurate than Kasparov, Fischer and Kramnik.

Maurycy | October 30, 2006 6:53 PM

If only people didn't die, or grow old, all this hard work wouldn't be necessary : )

Jeff Sonas | October 30, 2006 6:54 PM

I like Mig's suggestion of taking the top 20 from 1990, and obviously that's a great choice for a time where there was a big gap in strength between #2 and #3. After running some numbers, I would probably say 1988-89 instead because of Ivanchuk, but still... Other times that would be good candidates were Karpov/Korchnoi in 1979-80, and Steinitz/Zukertort in 1884-85. If you want to see the biggest gap between #3 and #4, then go with (obviously) Lasker/Capablanca/Alekhine in 1924 or Lasker/Pillsbury/Tarrasch in 1899-1900 or Kasparov/Anand/Kramnik in 2001.

Paul Massie | October 30, 2006 7:07 PM

Actually, I do think the choice of engine is highly relevant to these tests. Not because of strength, since as Mig says any error should be consistent across all players, but because of style. Different engines, like different people, have varying styles. Granted, all engines have a basically tactical style, but even within that there are large differences. I think to make this scientifically valid the test would need to be repeated with at least 3-4 different engines.

Paul

littlefish | October 30, 2006 7:08 PM

If you check out the old message board thread linked to by Russianbear above you'll see that I did the same kind of analysis for lower-rated contemporary players. It turned out that the average "error" (I'd rather speak of a "disagreement" or "compliance" with the engine) is nicely proportional to Elo rating, so I do think this method can tell us something at least about the tactical quality of games. How close this comes to objective playing strength is a different question, of course.

Introducing a complexity factor is a very interesting approach, but I'm not sure how accurate it is for determining how players would perform when facing "equally complex" positions. The definition of complexity looks a bit arbitrary, and the correction factors relatively small. Even the corrected error values with compensation for complexity still seem to favour players with a positional style, don't they?

Btw, isn't Toga available as open source and considerably stronger than Crafty? (Not that I'd expect it to make a big difference, this method is mainly about tactical accuracy, and even a relatively weaker engine should give reasonable results on that.)

Dimi | October 30, 2006 7:47 PM

Seeing Bratko's name on the paper brought memories -- he had a Prolog book 20 years ago when I was involved with Logic Programming and AI. Published a couple of papers as an undergrad, working with another Prolog guru.

Anyway, what I'd really like to see is the computer programs participating in regular tournaments. In 1996 I made a $1000 bet that no computer program can win the World Chess Championship if it has to play all of the games -- from qualifications, through elimanation, and finally a match. My basic premise at the time was that after a significant exposure to playing with various people, certain patterns, or even bugs will emerge that will allow the humans to play its weak sides and ultimately defeat it.

I don't like these hastily organized matches when they uncover a brand new box and let it run against somebody -- the human is at a disadvantage not having seen this particular program and how it plays. It's like a brand new GM with no history. And in a broader sense, history is what the human brain can process and adapt to.

When Kasparov lost to DB in 1997 the parties demanded the $1000, but to this day I haven't seen a satisfactory exercise where a computer program is seen playing 10 different people over a period of time and really proven the winner. Perhaps, Fritz being so exposed to so many users must be highly debugged by now.

macuga | October 30, 2006 7:49 PM

This study clearly reveals one thing: Capabalanca was cheating. Obviously, when his opponents weren't looking, he snuck off to the toilet to consult his Crafty engine.

greg koster | October 30, 2006 7:50 PM

Andrew--

Spassky played three WCC matches.

greg koster | October 30, 2006 7:59 PM

"Lasker, Tal, Alekhine, and to a lesser extent Kasparov, all believed that putting their opponents under pressure was worth at least a pawn and they backed this up consistently."

In their tournament games, yes, but not in their WCC matches. Not much swashbuckling going on in Alekhine-Capablanca. And in many of the K-K games you couldn't tell who's Karpov and who's Kasparov.

zigomar | October 30, 2006 8:01 PM

My nominations for a non-champ. Warning - it’s a personal list.
You will notice that there are ex or future champions appearing in the list; this is so because of my belief that the ex/future champion was at least some part of the time, playing stronger chess then the current champion.
There is also a disparity in number of names; champions like Botvinik or Lasker held the title longer then most and its only natural that there would be more opportunity for other people to ‘beat’ them.

Steinitz – Tchigorin ; Tarrasch ; Lasker
Lasker – Tarrasch ; Maroczy; Rubinstein ; Capablanca
Capablanca – Rubinstein
Alekhine – Capablanca, Keres, Botvinik, Flohr,
Euwe – Alekhine, Keres
Botvinik – Keres, Bronstein, Boleslavski (?), Smislov
Smislov – Keres, Botvinik
Tal – Botvinik, Keres
Petrosijan – Keres, Geller, Spaski
Spaski – maybe Fischer from very late 1970 - early 1971, before that no one
Fischer – Karpov, Korchnoi (?)
Karpov – Korchnoi, Anderson, Ljubojevic
Kasparov – Karpov, Ljubojevic, Anand, Ivanchuk
Kramnik – Anand

Jeff Sonas | October 30, 2006 8:31 PM

I also agree that the sample size needs to be much larger but I think their computer resources required a more limited set of games at first. However, if we can bring to bear the world's chess resources on this problem, then...

Just for fun, a few months back I played around with trying to identify the peak 12-year stretch for top players, using a combination of their rating and their world # rank. Here were the top 40 players all time, along with their peak 12-year periods:

1. Kasparov, Garry (1988-1999)
2. Lasker, Emanuel (1891-1902)
3. Karpov, Anatoly E (1983-1994)
4. Capablanca, José R (1918-1929)
5. Botvinnik, Mikhail M (1938-1949)
6. Alekhine, Alexander A (1924-1935)
7. Anand, Viswanathan (1994-2005)
8. Fischer, Robert J (1961-1972)
9. Kramnik, Vladimir (1993-2004)
10. Smyslov, Vassily V (1949-1960)
11. Korchnoi, Viktor L (1971-1982)
12. Steinitz, Wilhelm (1882-1893)
13. Maróczy, Géza (1899-1910)
14. Ivanchuk, Vassily (1988-1999)
15. Petrosian, Tigran V (1958-1969)
16. Tal, Mikhail (1957-1968)
17. Pillsbury, Harry N (1894-1905)
18. Spassky, Boris V (1960-1971)
19. Tarrasch, Siegbert (1888-1899)
20. Keres, Paul (1953-1964)
21. Bronstein, David I (1947-1958)
22. Najdorf, Miguel (1943-1954)
23. Zukertort, Johannes H (1876-1887)
24. Schlechter, Carl (1900-1911)
25. Polugaevsky, Lev A (1968-1979)
26. Reshevsky, Samuel H (1946-1957)
27. Timman, Jan H (1979-1990)
28. Beliavsky, Alexander G (1980-1991)
29. Euwe, Machgielis (1932-1943)
30. Chigorin, Mikhail I (1893-1904)
31. Bogoljubow, Efim D (1924-1935)
32. Marshall, Frank J (1908-1919)
33. Janowsky, Dawid M (1896-1907)
34. Portisch, Lajos (1977-1988)
35. Blackburne, Joseph H (1880-1891)
36. Nimzowitsch, Aron (1923-1934)
37. Shirov, Alexei (1993-2004)
38. Flohr, Salo (1930-1941)
39. Geller, Efim P (1959-1970)
40. Topalov, Veselin (1994-2005)

I would suggest going as far down the list as possible, taking each player's games during their peak 12 years against top-ten opposition or something like that. It seems like the data would be more balanced that way, as opposed to drawing big conclusions based upon games against just one or two opponents.

Matej Guid | October 30, 2006 9:24 PM

Very interesting comments... I'll try to provide some answers to the posted questions.

Maurycy, you are right saying that a bigger sample of games would probably lead to more reliable results. We chose for the analysis only the matches for the title of "World Chess Champion", where the champions contended for or were defending the title. These matches were always taken very seriously, it was not only about the title - there were also high financial prizes involved and the players usually spent months of preparation for these matches. The expected quality of play is therefore much higher than in ordinary tournaments. That was also the reason why include ONLY these games. We also didn't have the required time and resources to analyse ALL the games of the selected players.

However, two exceptions were made exactly for this reason. Kramnik and Fischer only had 29 and 20 games (respectively) from the above mentioned matches. Obviously there was a reasonable concern that that would not be enough for reliable results (Kasparov and Karpov, for example, had almost 200 games each). So we decided to add some additional games. Which games were to be chosen for assuring the same or at least similar conditions as were in the world championship matches? Obviously the candidate matches that have lead the players to the world championship match seemed to be the most sensible choice. For Kramnik, the games from the match with Shirov (1998) were added, and Fischer's candidate matches with Taimanov, Larsen and Petrosian were included. Note that even if Fischer results in these matches was really impressive (18.5/21), his average score in these matches alone was 0.1183 (still worse than both Capablanca's and Kramnik's score). That might directly confirm our expectations that the results actually don't actually reflect the quality of play (unless Fischer's play was so incredible in these games that Crafty didn't understand much - we probably won't find out until we try Rybka or any other engine). And if we now replace Kramnik-Shirov with Kramnik-Topalov, not much will change on top also. It was a minor slip in the article not to add a footnote to explain this, somehow it escaped my attention and I think it's appropriate to correct this now.

The additional games probably explain your interesting observations, Andrew Dimond. Obviously the candidate matches influenced Fischer's complexity score and the same could be said for Kramnik-Shirov. Note also that Kasparov played a lot of his games against Karpov and that certainly lowers his complexity score.

However, I'm not saying that the method for complexity measurement could not be improved. Just the opposite, I think that when computer "changes his mind" at greater depths should weight more than at shallower depths and my further research confirmed this observation. It's also important to note that the suggested complexity measurement seems to work well only on average, when you take a large sample of positions and is not so reliable when evaluating individual positions - the calculated standard deviations are just too high. In order to improve the method, probably some factors from cognitive psychology should also be taken into account, although that may be far from easy.

Mig, I like your suggestion and it's certainly worth trying.. Actually I performed some tests on various players of different FIDE ratings and while Crafty obviously distincts 2750 players from 2350 players, there were quite significant deviations among players of the same rating groups. On the other hand, it's questionable what does it prove, since we probably all know very well that players form might vary a lot from tournament to tournament, hence all the ups and downs on subsequent rating lists... Anyway, among the 2700+ players that I took into consideration, none of them surpassed Kasparov nor Karpov.

How reliable Crafty was for our analyses was one of the crucial questions since the very beginning. Another question is to what extent the small occasional errors really do cancel out through statistical averaging. As Paul Massie noted, probably several different engines would have to be used for analyses and I share the same opinion. It would be also very interesting to see how much would results of these engines vary. Although there might be some "Craftiness" reflected in the results, I find it hard to accept that they actually prove that Capablanca was a 2600 player. Kramnik's score was just about the same and I believe he clearly deserves higher rating, doesn't he? ;) Also do not forget that the both players distinctly deviated from the others and that may well be the case with other engines as well...!?

littlefish | October 30, 2006 9:41 PM

I should add, my original analysis was in this thread: http://www.chessninja.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=9;t=000728;p=1

FrankM | October 31, 2006 12:41 AM

The approach is of course to estimate a player's probability of selecting the best move (and avoiding blunders). On the whole, "GM-routine" thinking is evaluated, and blunder-checking should mean something. However, the effects of brilliant innovations or stunning moves will be neglected - e.g. some of the first games of Kramnik-Topalov had moves that Rybka didn't figure out with moderate think time. This is probably what distinguishes a super-GM from a normative GM. Editing/Human correction of the computer evaluations for unusual great moves would improve this kind of study, and the extensive published analysis of WC games would make it possible. Note that the effects of adjournment rules, and very long think time in Golden Age games should also be accounted for : at the very least, it means that quality of play is not necessarily a valid comparison for quality of players (like comparing active chess vs. classical).

There are also questions of style, and complexity only scratches the surface. Lasker had a reputation for playing inferior moves for psychological reasons, and Topalov has recently played plenty of moves that complicated positions, in some cases with success. Here conditional probabilities depend not only on game theory, but on "gaming" the opponent - and it matters who the opponent might be.

Peter Ballard | October 31, 2006 12:42 AM

As I've said elsewhere: if only world championship matches are taken into account, why did Capablanca rate so well? Did he "outrate" Alekhine when he lost to him?

greg koster | October 31, 2006 6:06 AM

Capablanca played also a WCC match with Lasker.

Peter | October 31, 2006 8:47 AM

Bobby Fischer did a new radio interview on an iceland radiostation. You can find the whole interview, but only with the english parts on www.deep-chess.de

Bill M | October 31, 2006 9:55 AM

"why did Capablanca rate so well? Did he "outrate" Alekhine when he lost to him?"

Alekhine played other championship matches where his playing level was probably lower.

David Wagle | October 31, 2006 10:01 AM

Mig says:

"It's actually quite a bit more interesting than I thought it was going to be at the start, although there are a few key failings. Not of methodology, just the limitations of computers and statistics analyzing a human game."

I find it interesting how many people fail to recognize that a game of chess is merely a sequences of steps in a finite state table. Or at least fail to precisely choose their language when talking about it. It's a really really big state table, but still a state table.

The only reason chess hasn't been "solved" is that the state table is so large. However, that doesn't make any aspect of the game random. Because of limitations of both people and computers right now, our perception of the game is that it is one where there is an element of "chance." But there isn't.

Kasparov can't make moves appear on the board that are impossible for a lesser player to play. What he can do is find patterns, steps in the state tree, that escape the notice of lesser players. But the potential for those moves from the same position is equal for all players.

What makes the use of ANY engine in such analysis viable is that the engine will use precisely the same criteria at all times to evaluate the state tree.

This means that we can compare apples-to-apples across eras and across player styles with regard to the standard of how this particular engine evaluates the state of the game.

Now, again do to limitations of current computer systems, it is impossible to cover the entire tree, and so the answer we get back from this kind of analysis is itself only one data point. However, it is relivant to the question being asked.

This methodology, perhaps with some modifications, would answer the question completely and fully if nad when we have a computer that can solve the state tree. While that may never happen, it is not beyond the realm of possibility.

Although to get there, we need some major advances in both technology and mathematics -- perhaps most important woudl be the ability to significantly compress the information about each state or collection of states to a manageable size.

Mig | October 31, 2006 10:19 AM

In other words, the limitations of computers and statistics. Sounds familiar.

I don't think there is anyone who visits this site who doesn't know the endless clever stats about how many possible positions, how many possible games, etc. But this study didn't pretend omniscience. Secondly, even if we had a computer that solved chess and ran these same calculations to absolute perfection on every move, it would be irrelevant to my point, which was that in human chess preference matters because results matter. If you gave a GM the task of playing a game that would most please a computer that's quite different from asking him to beat his opponent *even if pleasing the computer means objective perfection.*

That is where the apples to apples comparison breaks down. Humans don't play for accuracy, they play to win. These goals coincide most of the time, maybe even 90+% of the time. But in that other 10% we have the difference between Karpov and Tal. Both brilliant tacticians but with entirely different needs and resources when it comes to winning a chess game. Show the same position to Karpov, Tal, and Crafty (Rybka, whatever) and you may get two or three different answers for best move. They will probably all be correct even if Crafty considers Tal's move a sixth of a pawn inferior to Karpov's.

The god computer we envision might be able to say conclusively that one of the moves is a forced mate in 3,403 (a low estimate) and the other move a forced loss in 2,338. Irrelevant to the matter of human strength. Barring serious blunders, Tal's move will give Tal the best chance to outplay his opponent and he was well aware of this. Crafty is not. Even the god machine is not.

Yuriy Kleyner | October 31, 2006 10:25 AM

David,

As has been pointed out by numerous people the criteria is only constant in the mind of the computer. It's true that it's possible to play the most technologically sound move, but that is only the best move if playing a computer who is affected only by objective strength of a move, not the amount of time and energy it takes to analyze it, the amount of home preparation spent on the move, how well researched an opening is, the player's particular comfort in the on-board situation created by the move, etc. It is possible to create a "perfect" chess game, though again you run into issues like playing for a draw--which I would guess to be the most likely result of a game played without any mistakes. The question of "what is the best move to make in a situation" is not, however, an objective one, based purely on technical strength of a move. There is hardly an algorithm for the chance of a particular player to make a mistake in each continuation or how comfortable/prepared he might be in a certain opening line.

Russianbear | October 31, 2006 10:56 AM

Matej Guid:

So what did you think of my idea of using Rybka for analysis and use Crafty to calculate complexity factors>

Mig: I disagree. I think the engine makes all the difference. Rybka is MUCH more humanlike. Of course, it still is a tactical monster, but it does much more things well than Crafty does, that is why it outrates Crafty by 300+ points. Also, Rybka looks at a very small number of positions per second compared to other top machines or even Crafty, but its evaluation function is what separates iot from the rest. Rybka is just more human in its evaluation. Crafty uses piece/board tables to make moves, so it thinks rook is the best on 7th rank no matter what (Well, unless it can win material elsewhere), but Rubka actually checks how useful the rook would be on the 7th rank, are there anything to attack there, will the rook be pinning something, will it be able to get back on defence if it needs to.

Read http://mysite.verizon.net/vzesz4a6/id23.html

Rybka is much more human-like in the way it evaluates positions. Its evaluation may not by as good as that of Kramnik, but then again, it does look at hundreds more positions than Kramnik, so that kinda makes up for whatever it may misevaluate. Yea, it is still is essentially a computer, but this computer does much more things much better than your Fritz or your Crafty. That's the reason I say Rybka should be used for exmining the World champions' games.

Think of it in reverse: suppose a human needs to examine the games of 1997 Deep blue, 2003 Deep Fritz, 2005 Hydra, and current versions of Fritz and Rybka and tell which engine is the strongest. Let's assume you have a sample of their games that have roughly equal results for each engine, and that you have no way of making them play one another, so you can only infer quality of their play from their moves. Which human is more likely to tell us which engine is the strongest - a guy rated 2500 or a world champion who is rated above 2815? I mean, a human is still human and a 2500-rated person can still offer some nice insight into which programs have better positional understanding, for example. But it is possible that the best programs surpass the 2500 rated human in all aspects of the game, including positional understanding (kinda like some world champs could be better than crafty at EVERYTHING), so the 2500 would totally misjudge those engines. It is kinda obvious that the 2800+ player would do a much better job at comparing the strength of engines - simply because he is a much better chess player. I don't mean to sound like Topalov, but 300+ rating points is a totally different level altogether, and Rybka is higher than Crafty by 300+ points! Being that much stronger allows it to be accurate in MANY aspects - not only in things that computers are traditionally strong at, but also at other things.

There is also another reason why I don't think a fact that Rybka is a computer would be somehow biased towards certain champions. Rybka is simply very strong, and I think the strongest machines and strongest people kinda converge in their chess understanding and chess styles, even though they use very different approaches. For example, That Be4 move in game 2 of deep blue match- it was so human like it psyched Kasparov out. Or the opposite example, I was examining game 22 of Alekhine-Capablanca match yesterday, and Rybka doesn't find the Bxe6 sacrifice that Alekhine made, simply because it is too deep and complex. So even though the position was tactical (with queens off the board, no less), even in tactical positions humans could (can) outcalculate the best computern programs of the present day.

Yuriy Kleyner | October 31, 2006 10:58 AM

I have always wondered if a series of best moves, according to a perfect analysis tool, would generate a scenario which would guarantee a win. And how does one objectively choose between two moves for a "perfect player," if both lead to a win anyway.

Yuriy Kleyner | October 31, 2006 11:01 AM

Russkiy Misha,

Interestingly though some of the best analysts in history were never world champions and some world champions never got too heavily into chess analysis. So pure chess skill is probably not the best indication of analytical ability.

Mig | October 31, 2006 11:10 AM

It's still a computer, RB. 95% of its strength comes from calculation and tactics. If that drops to 93% with superior knowledge that's helpful, but I don't think it will make much of a difference. An engine with more knowledge like Rybka or HIARCS would be more accurate than Crafty because of the fixed search depth methodology; I'm just not sure it would have much of an impact.

Note that those 300 points come from engine-engine play, not against humans. Computers kill each other in an entirely different way but they all kill humans in the same way and evaluate human play in an almost identical way: tactical mistakes. If you dropped Rybka and Crafty into the Ordix Open there is little doubt in my mind they would score around the same number of points. In Dortmund maybe there the difference is felt a little more. But when it comes to evaluating a world champ's moves in a few seconds I'm skeptical it makes much difference as long as the engine isn't simply feeble, and Crafty isn't.

Russianbear | October 31, 2006 11:11 AM

Yuriy, it is irrelevant to the point I was trying to make. My point was that a human rated 2800 has better chess understanding than a 2500, and therefore would be more qualified to pass judgements on which program is stronger. Whether they would be doing such analysis in a time span comparable to a tournament game or much longer time kinda like correspondence chess - is irrelevant. A corrspondence player rated 2800 would do a better job than a correspondence player rated 2500, if the format of the (hypothetical) analysis would be close to a correspondence game. And Kasparov would do a better job than a 2500 OTB player if they were to analyse at the board for a few hours.

Russianbear | October 31, 2006 11:26 AM

Mig, I agree that all computers kill humans in a similar way (even though there probably are exceptions like game 2 of 1997 deep blue match). But my point is not that Rybka would be more accurately assessing the blunders that world champs made - for that we don't need anything much better than crafty, you are right. I do think, however, that Rybka would be able to appreciate some of the more subtle positional plans/moves much more than Crafty would - and THAT is where its huge rating and more human approach would come in handy.

Another thing I don't know if I agree with is just because Rybka keeps killing other engines then it is to be assumed that it was done in tactical way. Rybka consciously limits the number of positions it looks at because it hopes to look at more critical ones. My point here is that I see 300 point rating difference and I cannot assume that all that superiority is just due to better tactics. Also, in

http://www.chessninja.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=8;t=000238;p=1#000013

Permanent brain says he found that Rybka is 5% more likely than Fritz to play a move human world champions played. And that is Fritz, not crafty, and I imagine Fritz is more humanlike than crafty, because it is much stronger.

And of course, I agree with you that at a few seconds per move it won't make a difference if one uses Crafty or Rybka or whatever. I think few seconds per move is not deep enough to judge with any sort of accuracy. You would only catch the biggest blunders that way, and that is what this study seems to have done. But I think if we ran Rybka at, say, 5 minutes per move, we would get much mroe interesting/meaningful results.

Yuriy Kleyner | October 31, 2006 11:33 AM

RB,

No, I disagree. Some players may have great analytical ability but function poorly in a competitive setting and not just because of time control.

Mig,

I think even more interesting than Karpov's best move and Tal's best move is best move against Karpov and against Tal.

Russianbear | October 31, 2006 11:41 AM

Yuriy, it may be true, but it is irrelevant to the point I was making. My point was (again!) that 2800 is a better player than a 2500, so he is to be trusted more when he says something about complex chess games.

Geez, I am not arguiing anything special, just that a 2800 understands chess better than a 2500. That is why people buy books on World champions written by Kasparov and not some 2500 and that is why I want to see Rybka analyze the games for a study like the one we are are talking about.

Matej Guid | October 31, 2006 12:02 PM

In the Buenos Aires (1927, 34 games) match, according to the analysis, Capablanca actually did outrate Alekhine. Both players did exceptionally well in this match and as you correctly guessed, Bill M, results show that Alekhine's playing level was indeed lower in the other matches he played afterwards. How is it possible to play better and still lose a match? Well, note that in the games that he (or she) lost, the player may commit more and bigger mistakes, and that may certainly have a large impact on the final score. We all know probably even too well how could exceptionally well played games be spoiled with just a single move. Although the results do matter, they don't always reflect the quality of play and vice versa.

Note also that moves, where both the move made and the move suggested had an evaluation outside the interval [-2, 2], were discarded and not taken into account in the calculations. The reason for this is the fact that a player with a decisive advantage often chooses not to play the best move, but rather plays a move which is still 'good enough' to lead to victory and is less risky. Similar situation arises when a player considers his position to be lost - a deliberate objectively worse move may be made in such a case to give the player a higher practical chance to save the game against a fallible opponent. This was another mechanism, designed to aim at most objective results possible. I address this point here to to deal with possible fears that any blunders in lost positions (intentional or not) spoiled the results in any way...

RussianBear, I think I already answered to your question about combining Rybka and Crafty. However, there's more. We designed a mathematical model, and will publish it soon, that, possibly surprisingly, shows:
1) To obtain a sensible ranking of players, it is not necessary to use a computer that is stronger than the players themselves. There are good chances to obtain a sensible ranking even using a computer that is weaker than the players.
2) The (fallible) computer will not exhibit preference for players of similar strength to the computer.

Some readers speculated that the program will give better ranking to players that have a similar rating to the program itself, and we intend to demonstrate that this is not the case. At the same time we would like to support our claims that using a different engine wouldn't affect results significally, although I'm also eager to see the confirmation (or refutation) with real experimental data.

Jean-Michel | October 31, 2006 12:06 PM

I thought the article was very interesting, but not really on the aspect everyone is debating.

When you analyse the games of these top players by comparing how well their moves match to those of a computer program, it doesn't necessarily mean that these players are stronger than other players.

But it certainly gives you a good objective indication of which players' style in WCC play has been most computer-like. I think that is interesting in and of itself.

Capablanca being ahead of the modern guys, in particular, is a huge surprise, I would imagine. I think this bit of data says something very important on the nature of chess, perhaps, on what has changed and evolved in the way we play, and what hasn't.

I don't have sufficient chess knowledge to be able to analyse what it could be, but it seems worth asking some questions by someone who knows more about chess than me.

The sad thing about a study like this is it gets published because it's claiming to examine "who is greater than whom", which everybody loves to argue about... but it cannot hope to do this. But what it actually does say, which for me is very interesting, gets pushed to the side by all the K. vs K. arguments.

Anyway, have fun with it all, it's all to be enjoyed any way you can.

Yuriy Kleyner | October 31, 2006 12:22 PM

RB,

I disagree. First of all, Kasparov's books sell because he is Kasparov--not because the prospective buyers evaluated the quality of the analysis inside. Second, understanding of chess is not directly correlated to one's chess rating. Some of the best analysis out there actually belongs to midlevel GMs and those are some of the best books out there that knowledgeable readers buy.

2800 is a better player than 2500. But he is not necessarily the best analyst.

Rooks | October 31, 2006 12:25 PM

Interesting stuff, good luck with your PhD Matej!

Albert Silver | October 31, 2006 1:17 PM

I have a lot of issues with the study's methodology and results. To begin with, there is the selection of games used.

Only games from World Championship matches were analyzed. There are two problems because of this:

1) The number of games per champion will vary wildly. The one with the most games/moves will clearly be Kasparov with over 10 times more games than Fischer, who played only one match.

2) Play in world championship matches is very much against the player. I'm not speaking of antics like the recent infamous toilet episodes, but things like Kramnik's specific match strategy against Kasparov, simplifying the positions all the time, to take Kasparov out of his middlegame comfort zone. Basically trying to kill Kasparov's edge. This is fine, but the authors then produce a chart concluding this is indicative of Kramnik's play period. I really don't agree. I'm not saying he doesn't like playing simplified positions, but that this particular aspect was exacerbated to fit his match strategy.

Then there is the issue of who is judging. It's true that even if an open source program such as Toga were used, that is roughly 250 Elo points stronger than Crafty, I'd still have issues, but the stats presented basically imply that perfect play means playing Crafty's choices made after some 30 seconds of analysis. Anything that is not chosen by Crafty is worse.

The charts showing the average error rate of Capablanca at 0.1008 per move, compared to Kasparov's 0.1292, not only state that Capablanca made less mistakes, but that Crafty should earn/increase an advantage over these players by that much per move.

In other words, if Crafty playing at 30 seconds per move (it's a rough time estimate based on their depth used) were to play a match against Kasparov playing at an average of 3 minutes a move, Crafty's position should improve by +1 (1 pawn) every 8 moves or so.

In fact, the chart basically states that Crafty's *minimum* edge over the World Champions, at the peak of their careers, is 1 pawn per 10 moves against Capablanca, and 1 pawn every 6-7 moves for the rest, excluding Steinitz.

I cannot tell you how absurd I think that is.

One small addendum to the authors who state that Kasparov's work on WCs is merely one of GM analysis and GM opinion. This is incorrect. IIRC, Kasparov stated that he had 3 computers running 24 hours analyzing and checking analysis. MIG could no doubt shed more light on this.

Albert Silver

Dimi | October 31, 2006 2:44 PM

Are the heuristics of position evaluation pretty standardized between the various programs these days, or there are vast differences?

If the latter than I can't see how the analysis can agree. In this case computers wouldnot be any more objective than their programmers.

Per Migs comments -- indeed, people drive towards positions they play best. Therefore the objectivity of the best move is very subjective.

The only objective evaluation from a computer I'd totally trust is the exhaustive computation of the quickest victory path within a certain position.

Jakob B. | October 31, 2006 3:03 PM

Another suggestion for the control group:
An elite player who is the incarnation of the "put the opponent under pressure"-type is Larsen.
His philosophy was not to look for "correct moves" but for moves that made the life of his opponent hard.
It could be interesting how this type of analysis would rate him.

tjallen | October 31, 2006 4:03 PM

Congrats to the study's authors for their work, which I thought was quite interesting.

What comparison is there between a player who plays a series of moves of standard GM strength, vs. a player who alternates very strong moves with slightly weaker than average moves? (Their average move strength is equal, but one plays more strong and weak moves.) Would the study note this difference?

Similarly, what of the player who puts one blunder in each of four games, vs. the player who plays three perfect games followed by one game with four blunders? (both average one blunder per game...)

Most of the world championship games happened in the era of adjournments, so: What about the increase in errors near time controls? What about play just after adjournment, which should be near-perfect, if the players have been reviewing all night?

Ah, statistics!
tjallen

Cynical Gripe | October 31, 2006 7:13 PM

------------------------------------------------
I really don't think the engine is that relevant. It's the same methodology across the board so it should be wrong to the same degree for all players, meaning the stats are still doing what they were designed to do, compare. A stronger engine would be the same degree more accurate across the board.
------------------------------------------------

I disagree strongly with this claim, as it is based on the assumption that the engine provides an unbiased estimate of the move quality. In other words, statistics only tell the tale if the sample mean (of the evaluation) is equal to the true mean, a very dubious assumption considering the players being evaluated are stronger than the engine.

---------------------------------------------
I disagree. I think the engine makes all the difference.
---------------------------------------------

---------------------------------------------
The trouble of course is that if Crafty is truly only around 2620 strength, its evaluation errors are not "small and occasional" but frequent and sometimes rather severe.
----------------------------------------------

Agreed.

Cynical Gripe | October 31, 2006 8:02 PM

I should actually have said asymptotically unbiased (bias goes to zero as the number of samples tends to Inf). BTW, if you think I'm full of it, consider the extreme case of letting a 1200 rated engine be your benchmark. Would you still be confident in the results assuming a large enough sample population? To know how strong a given engine must be to produce meaningful results in this experiment would require knowledge of its evaluation error statistics relative to the discrepancies in results for the top players - probably not easy to come by since 'perfect' evaluations don't exist.

Mangafranga | October 31, 2006 9:44 PM

Was an opening book used?

Russianbear | October 31, 2006 10:49 PM

Matej Guid, I replied to you in http://www.chessninja.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=8;t=000238;p=1#000018

I think it is easier to have this conversation in a forum instead of a blog entry comment. Please let me know what you think. Thanks.

Matej Guid | November 1, 2006 8:09 AM

Hi RussianBear, I can see that the discussion on that forum took a very interesting course. I'll join when I have some spare time, hopefully tomorrow..

Krish.Adam | November 1, 2006 1:46 PM

There is no 'unbiased' way of analyzing GM games; even computers are biased, the difference is that machines aren't aware of it. That's all!

Also, one should doubt some of the former World Championships and the current ones being held under FIDE. The former soviet union regime is known for its atrocities and the current FIDE president is infamous for that. Therefore, the best approach would be to include all or most of the games played between the top 10 players of different eras including all the major 'grand slams' not just the World Championships, to get a real picture of who the best player was. That's why I like Jeff Sonas's approach of the best 12-yr stretch. You can see players like Anand rated better than today's giants Kramnik and Topalov.

Another factor to be considered is the computer-assisted or seconds-assisted cheating that might have been going for decades now. To avoid this, there should be a separate statistics on the top 10 best fastest players of all time. Rapid tournaments and Rapid World champiohips may truly live up to the results for the simple reason that there is no time for any 'external' influence or the sitting in the toilet for meditation! Results would then be reliable and should be fed into the classical games to see if there is a correlation. After all, classical games are also time-bound!

-Krish.Adam

Most Computerlike Player Ever

Categories:

61 Comments

Twitter Updates

About this Entry

Most Computerlike Player Ever

Categories:

61 Comments

Twitter Updates

Archives

About this Entry