> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.
That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.
SamPatt 5 hours ago [-]
Author here - it's a fair criticism, and I point it out in the article. However, I kept it in for a few reasons.
I'm trying to show the model's full capabilities for image location generally, not just playing geoguessr specifically. The ability to combine web search with image recognition, iteratively, is powerful.
Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.
It beat me in the Colombia round without search at all.
It's worthwhile to do a proper apples and apples comparison - I'll run it again and update the post. But the point was to show how incredibly capable the model is generally, and the lack of search won't change that. Just read the chain of thought, it's incredible!
LeifCarrotson 4 hours ago [-]
There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!
It's not interesting playing chess against Stockfish 17, even for high-level GMs. It's alien and just crushes every human. Writing down an analysis to 20 move depth, following some lines to 30 or more, would be cheating for humans. It would take way too long (exceeding any time controls and more importantly exceeding the lifetime of the human), a powerful computer can just crunch it in seconds. Referencing a tablebase of endgames for 7 pieces would also be cheating, memorizing 7 terabytes of bitwise layouts is absurd but the computer just stores that on its hard drive.
Human geoguessr players have impressive memories way above baseline with respect to regional infrastructure, geography, trees, road signs, written language, and other details. Likewise, human Jeopardy players know an awful lot of trivia. Once you get to something like Scrabble or chess, it's less and less about knowing words or knowing moves, but more about synthesizing that knowledge intelligently.
One would expect a human to recognize some domain names like, I don't know, osu.edu: lots of people know that's Ohio State University, one of the biggest schools in the US, located in Columbus, Ohio. They don't have to cheat and go to an external resource. One would expect a human (a top human player, at least) to know that taxilinder.at is based in Austria. One would never expect any human to have every business or domain name memorized.
With modern AI models trained on internet data, searching the internet is not that different from querying its own training data.
vlovich123 1 hours ago [-]
> It's not interesting playing chess against Magnus, even for high-level GMs. He just crushes almost every human
The differences even among humans between the absolute best & those out of the top 10 tend to be pretty drastic. And a non-IM against Magnus won't even understand what's going on. You could similarly claim that Magnus just memorized a bunch of openings which is similar to criticism GMs level too which is why Chess960 is now gaining more traction. My point is that there's not really such a thing as "fair" in a competition.
Re geoguessr, why not let them whatever tools are available? I have similar critiques about bike racing & restrictions on the technology advancements they can put on the bike. But every competition chooses arbitrary lines to draw which compose the rules so it doesn't really matter.
sensanaty 37 minutes ago [-]
I mean Geoguessr explicitly states when you launch the game (in PvP mode) that googling/searching is bannable
mrlongroots 3 hours ago [-]
To reframe your takeaway: you want to benchmark the "system" and see how capable it is. The boundaries of the system are somewhat arbitrary: is it "AI + web" or "only AI", and it is not about fairness as much as about "what do you, the evaluator, want to know".
tshaddox 2 hours ago [-]
> There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!
And a lot of human competitions aren't designed in such a way that the competition even makes sense with "AI." A lot of video games make this pretty obvious. It's relatively simple to build an aimbot in a first-person shooter that can outperform the most skilled humans. Even in ostensibly strategic games like Starcraft, bots can micro in ways that are blatantly impossible for humans and which don't really feel like an impressive display of Starcraft skill.
Another great example was IBM Watson playing Jeopardy! back in 2011. We were supposed to be impressed with Watson's natural language capabilities, but if you know anything about high-level Jeopardy! then you know that all you were really seeing is that robots have better reflexes than humans, which is hardly impressive.
rowanG077 2 hours ago [-]
You seem indicate you want a computer to beat a human without ever using what a computer is actually good at(large memories, brute force compute etc). That seems a little ridiculous to me. How do you want it to engage? Disallowed use of native compute and must simulate a full human brain?
Sure I do agree that the web search is too far, because it's literally cheating. But stockfish is super human at chess, it doesn't really matter that it can do this by leveraging the strengths of a computer.
monadINtop 5 minutes ago [-]
I disagree, if we're gonna be hyping up machines for their prowess at "thinking" and being artificially "intelligent" in that soft effusive human way then yeah I think its fair criticism. We already knew from the 50s that computers are like stupid geniuses when it comes to following algorithms and crunching computations far too expansive and tedious for any human.
k4rli 4 hours ago [-]
It's still as much cheating as googling. Completely irrelevant. Even if it were to beat Blinky, it's not different from googlers/scripters.
SamPatt 2 hours ago [-]
I disagree. I ran those rounds again, without search this time, and the results were nearly identical:
I tried the image without search and it talked about Dornbirn anyway but ended up choosing Bezau which is really quite close.
edit - the models are also at a disadvantage in a way too, they don't have a map to look at while the pick the location.
SamPatt 2 hours ago [-]
Yes, I re-ran those rounds and it made the same guesses without search, within 1km I believe.
You're right about not having a map - I cannot imagine trying to line up the Ireland coast round without referencing the map.
Ukv 6 hours ago [-]
The author did specifically point out that
> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.
and had noted in the methodology that
> Browsing/tools — o3 had normal web access enabled.
Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.
godelski 5 hours ago [-]
Pointing out that it is cheating doesn't excuse the lie in the headline. That just makes it bait and switch, a form of fraud. OP knew they were doing a bait and switch.
I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?
sdenton4 4 hours ago [-]
The people pissed about clickbait headlines were often overstating things to drum up outrage and accumulate more hacker news upboats...
godelski 2 hours ago [-]
I'm not sure why you're defending clickbait. It is just fraud. I'm not sure why we pretend it is anything different.
Sure, people made overblown claims about the effects, but that doesn't justify fraud. A little fraud is less bad than major fraud, but that doesn't mean it isn't bad.
627467 4 hours ago [-]
Cheating implies there's a game. There isn't.
> Titles and headlines grab attention, summarize content, and entice readers to engage with the material
I'm sorry you felt defrauded instead. To me the title was very good at conveying to me the ability of o3 in geolocating photos.
hatthew 3 hours ago [-]
Title says o3 beat a [human] player. That implies there is some competition that has the capacity to be fair or unfair.
tshaddox 2 hours ago [-]
Sure, but o3 is itself already an online service backed by an enormous data set, so regardless of whether it also searched the web, it's clearly not literally "playing fair" against a human.
godelski 2 hours ago [-]
But it still bounds the competition. OP is skilled in the domain. I'm not, so if I wrote a post about how O3 beat me you'd be saying how mundane of a result it is. I mean, I suck at Geoguesser. Beating me isn't impressive. This is also a bound
sebzim4500 3 hours ago [-]
Presumably being an AI is technically against the GeoGuessr rules so I don't see how there can be an expectation that those rules were followed.
SecretDreams 4 hours ago [-]
What's your suggestion for an alternative headline?
godelski 2 hours ago [-]
Can O3 Beat a Master-Level GeoGuessr?
How Good is O3 at GeoGuessr?
EXIF Does Not Explain O3's GeoGuessr's Performance
O3 Plays GeoGuessr (EXIF Removed)
But honestly, OP had the foresight to remove EXIF data and memory from O3 to reduce contamination. The goal of the blog post was to show that O3 wasn't cheating. So by including search, they undermine the whole point of the post.
The problem really stems from the lack of foresight. Lack of misunderstanding the critiques they sought to address in the first place. A good engineer understands that when their users/customers/<whatever> makes a critique, that what the gripe is about may not be properly expressed. You have to interpret your users complaints. Here, the complaint was "cheating", not "EXIF" per se. The EXIF complaints were just a guess at the mechanism in which it was cheating. But the complaint was still about cheating.
SamPatt 1 hours ago [-]
>The goal of the blog post was to show that O3 wasn't cheating.
No, the goal of the post was to show that o3 has incredible geolocation abilities. It's through the lens of a Geoguessr player who has experience doing geolocation, and my perspective on whether the chain of thought is genuine or nonsense.
In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.
The ability to perform web search doesn't undermine the claim that o3 has incredible geolocation abilities, because it still needs to have an underlying capability in order to know what to search. That's not true for simply reading EXIF data.
This is the best way I knew to show that the models are doing something really neat. Disagreements over the exact wording of my blog post title seem to be missing the point.
bahmboo 4 hours ago [-]
The headline said the AI beat him, it did not say it beat him in a GeoGuessr game. The article clearly states what he did and why.
jasonlotito 4 hours ago [-]
One of the rules is banning the use of third-party software or scripts.
Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.
That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.
> Did we just stop caring?
We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.
godelski 2 hours ago [-]
This doesn't mean you shouldn't try to make things as far as possible. Yes, it would still technically violate rules, but don't pretend like this is binary.
> We stopped caring about pedantry
Did we? You see to be responding to my pedantic comment with a pedantic comment.
kenjackson 4 hours ago [-]
Technically the LLM is 3rd party software so the use of it is cheating. QED
4 hours ago [-]
jahsome 4 hours ago [-]
[flagged]
halfmatthalfcat 4 hours ago [-]
Look up "royal we".
NineWillows 3 hours ago [-]
We all know about the "royal we". We still don't appreciate your (the "royal your") usage of it.
halfmatthalfcat 3 hours ago [-]
Do "we all" know?
NineWillows 41 minutes ago [-]
whoosh
_heimdall 6 hours ago [-]
This seems like a great example of why some are so concerned with AI alignment.
The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.
umanwizard 5 hours ago [-]
No, the game rules aren't ambiguous. This is 100% unambiguously cheating. From the list of things that are definitely considered cheating in the rules:
> using Google or other external sources of information as assistance during play.
The contents of URLs found during play is clearly an external source of information.
tshaddox 2 hours ago [-]
o3 already is an external source of information. It's an online service backed by an enormous model generated from an even more enormous corpus of text via an enormous amount of computing power.
GaggiX 5 hours ago [-]
I believed the rules were not explained to the model so it does use what it can.
misnome 5 hours ago [-]
Then you can 100% not claim it is “Playing” the game
ben_w 4 hours ago [-]
That right there *is the alignment problem*.
If I task an AI with "peace on earth" and the solution the AI comes up with is ripped from The X-Files* and it kills everyone, it isn't good enough to say "that's cheating" or "that's not what I meant".
It's playing a game in which the rules are a bit ambiguous if not explained.
spookie 6 hours ago [-]
A human can also use the same tools if it wasn't for the rules or fair play. They should've simply redone the test.
ceph_ 6 hours ago [-]
The AI should be forced to use the same rules as the human. Not the other way around. The AI shouldn't be using outside resources.
voxic11 4 hours ago [-]
Another rule bans "using third-party software or scripts in order to gain an unfair advantage over other players."
So is it even possible for O3 to beat another player while complying with the rules?
ben_w 4 hours ago [-]
If a player uses such a model, the model is third-party and the player is cheating.
But: when a specific model is itself under test, I would say that during the test it becomes "first" (or second?) party rather than "third".
bscphil 5 hours ago [-]
I think that's part of the point they're making, hence "They should've simply redone the test."
krferriter 5 hours ago [-]
An AI being better than a human at doing a google search and then skimming a bunch of pages to find location-related terms isn't as interesting of a result.
inerte 5 hours ago [-]
How the heck is not? Computers are looking into screenshots and searching the internet to support their "thinking", that's amazing! Have we become so used to AI and what was impossible 6 months ago is shruggable today?
I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.
jug 3 hours ago [-]
Yeah, it's a funny take because this is in fact a more advanced form of AI with autonomous tool use that is just now emerging in 2025. You might say "They could search the web in 2024 too" but that wasn't autonomous on its own, but required telling so or checking a box. This one is piecing ideas together like "Wait, I should Google for this" and that is specifically a new feature for OpenAI o3 that wasn't even in o1.
While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.
ekidd 4 hours ago [-]
I fully expect that someday the news will announce, "The AI appears to be dismantling the moons of Jupiter and turning them into dense, exotic computational devices which it is launching into low solar orbit. We're not sure why. The AI refused to comment."
And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."
The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.
eru 4 hours ago [-]
Yes, but I choose to interpret that as a good thing. It is good that progress is so swift and steady that we can afford to keep moving the goalposts.
Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.
godelski 5 hours ago [-]
In the same way a calculator performing arithmetic faster than humans isn't impressive. The same way running regex over a million lines and the computer beating a human in search isn't impressive
ludwik 4 hours ago [-]
Neither is impressive solely because we've gotten used to them. Both were mind-blowing back in the day.
When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.
godelski 2 hours ago [-]
I agree, but without removing search you cannot decouple. Has it embedded a regex method and is just leveraging that? Or is it doing something more? Yes, even the regex is still impressive but it is less impressive that doing something more complicated and understanding context and more depth.
rowanG077 2 hours ago [-]
I think both are very impressive, world shattering capabilities. Just because they have become normalized doesn't make it any less impressive in my view.
godelski 23 minutes ago [-]
That's a fair point, and I would even agree. Though I think we could agree that it is fair to interpret "impressive" in this context as "surprising". There's lots of really unsurprising things that are incredibly impressive. But I think the general usage of the word here is more akin to surprisal.
SamPatt 2 hours ago [-]
That isn't what's happening though. I re-ran those two rounds, this time without search, and it changed nothing. I updated the post with details, you can verify it yourself.
Claiming the AI is just using Google is false and dismissing a truly incredible capability.
arandomhuman 5 hours ago [-]
But then they couldn't make a click bait title for the article.
silveraxe93 6 hours ago [-]
Yeah, the author does note that in the article. He also points it out in the conclusion:
> If it’s using other information to arrive at the guess, then it’s not metadata from the files, but instead web search. It seems likely that in the Austria round, the web search was meaningful, since it mentioned the website named the town itself. It appeared less meaningful in the Ireland round. It was still very capable in the rounds without search.
rafram 6 hours ago [-]
Seems like they should've just repeated the test. But without the huge point lead from the rounds where it cheated, it wouldn't have looked very impressive at all.
silveraxe93 6 hours ago [-]
People found the original post so impressive they were saying that it had to be coming from cheating by looking at EXIF data.
The point of this article was to show it doesn't.
It got an unfair advantage in 1 (and say 0.5) out of 5. With the non-search rounds still doing great.
If you think this is unimpressive, that's subjective so you're entitled to believe that. I think that's awesome.
godelski 4 hours ago [-]
Sorry, I think I misread you. I think you said
People accused it of cheating by reading EXIF data. They were wrong, it cheated by using web search. That makes the people that accused it of cheating wrong and this post proves that.
And is everyone forgetting that what OpenAI shows you during the CoT is not the full CoT? I don't think you can fully rely on that to make claims about when it did and didn't search
SamPatt 5 hours ago [-]
That's inaccurate. It beat me by 1,100 points, and given the chain of thought demonstrated that it knew the general region of both guesses before it employed search, it would likely have still beaten me in those rounds. Though probably by fewer points.
I will try it again without web search and update the post though. Still, if you read the chain of thought, it demonstrates remarkable capabilities in all the rounds. It only used search in 2/5 rounds.
godelski 4 hours ago [-]
I'd be interested at capabilities without web search. The displayed CoT isn't the full CoT so it's hard to know if it really is searching or not. I mean it isn't always obvious when it does. Plus, the things are known to lie ¯\_(ツ)_/¯
SamPatt 4 hours ago [-]
I do understand the skepticism, and I'll run it again without search to see what happens.
But a serious question for you: what would you need to see in order to be properly impressed? I ask because I made this post largely to push back on the idea that EXIF data matters and the models aren't that capable. Now the criticism moves to web search, even though it only mattered in one out of five rounds.
What would impress you?
godelski 2 hours ago [-]
> What would impress you?
I want to be clear that you tainted the capacity to impress me by the clickbait title. I don't think it was through malice, but I hope you realize the title is deceptive.[0] (Even though I use strong language, I do want to clarify I don't think it is malice)
To paraphrase from my comment: if you oversell and under deliver, people feel cheated, even if the deliverable is revolutionary.
So I think you might have the wrong framing to achieve this goal. I am actually a bit impressed by O3's capabilities. But at the same time you set the bar high and didn't go over or meet it. So that's going to really hinder the ability to impress. On the other hand, you set the bar low, it usually becomes easy to. It i slike when you have low expectations for a movie and it's mediocre you still feel good, right?
"Technically cheating"? Why even add the "technically".
It just gives the impression that you're not really objectively looking for any smoke and mirrors by the AI.
SamPatt 2 hours ago [-]
I hear you - but I had already read through the chain of thought which identified the right region before search, and had already seen the capabilities in many other rounds. It was self-evident to me that the search wasn't an essential part of the model's capabilities by that point.
Which turned out to be true - I re-ran both of those rounds, without search this time, and the model's guesses were nearly identical. I updated the post with those details.
I feel like I did enough to prove that o3's geolocation abilities aren't smoke and mirrors, and I tried to be very transparent about it all too. Do you disagree? What more could I do to show this objectively?
SamPatt 3 hours ago [-]
I did repeat the test without search, and updated the post. It made no difference. Details here:
The question is not only how much it helped the AI model but rather how much it would have helped the human.
This is because the AI model could have chosen to run a search whenever it wanted (e.g. perhaps if it knew how to leverage search better, it could have used it more).
In order for the results to be meaningful, the competitors have to play by the same rules.
ricardo81 4 hours ago [-]
>isn't playing fair.
the idea of having nth more dimensions of information, readable and ingestible within a short frame of time probably isn't either.
rowanG077 2 hours ago [-]
Isn't it? We totally accept this fact for humans. Some people are just insane at memorization. We don't call unfair that they use this to their advantage compared to someone who doesn't have that talent.
WhitneyLand 5 hours ago [-]
As models continue to evolve it may not even need to cheat.
Since web scale data is already part of pre-training this info is in principle available for most businesses without a web search.
The exceptions would be if it’s recently added, or doesn’t appear often enough to generate a significant signal during training, as in this case with a really small business.
It’s not hard to imagine base model knowledge improving to the point where it’s still performing at almost the same level without any web search needed.
ACS_Solver 3 hours ago [-]
I just tried (o4-mini-high) and had it come to the wrong conclusion when I asked about the location and date, because it didn't search the web. I have a photo of a bench with a sign mentioning the cancellation of an event due to the Pope's death. It impressively figured out the location but then decided that Pope Francis is alive and the sign is likely a prank, so the photo is from April Fools day.
Then after I explicitly instructed it to search the web to confirm whether the Pope is alive, it found news of his death and corrected its answer, but it was interesting to see how the LLM makes a mistake due to a major recent event being after its cutoff.
layman51 6 hours ago [-]
Using the decal as a clue is funny because what if there was a street scene where that happened to be misleading? For example, I had seen that a Sacramento County Sheriff car got to Europe and I guess it now belonged to a member of the public who is driving it with the original decals still attached. I wonder how the LLM would reason if it sees the car as “out of place”.
victorbjorklund 4 hours ago [-]
Probabilities. That could happen with anything. Someone could build a classic japanese house with a japanese garden in Hawaii. But Japan is probably a better guess if you see a japanese house with japanese fauna.
yen223 3 hours ago [-]
Stands to reason a human might get fooled by this as well
SamPatt 2 hours ago [-]
Absolutely!
It happens occasionally - the most common example I can think of it getting a license plate or other location from a tractor-trailer (semi) on the highway. Those are very unreliable.
You also sometimes get flags in the wrong countries, immigrants showing their native pride or even embassies.
CamperBob2 3 hours ago [-]
To be fair, my local copy of R1 isn't doing any searching at all, but it frequently says "A search suggests..." or something along the lines.
artninja1988 6 hours ago [-]
It is against the rules? I thought it's all fair game, but you are time constrained
But if anything, those rules benefit ChatGPT: it can remember ~all of Wikipedia and translate ~every language on Earth, while a human would need access to online services for that.
If anything, I'd think allowing looking stuff up would benefit human players over ChatGPT (though humans are probably much slower at it, so they probably lose on time).
kbenson 5 hours ago [-]
If it takes a model and database with a large chunk of the internet to compete and win, then that says something, as that's much more expensive and complex than just the model, because models have problems "remembering" correctly just like people.
It's important to have fair and equivalent testing not because that allows people to win, but because it shows where the strengths and weaknesses of people and current AI actually are in a useful way.
cocoto 5 hours ago [-]
Connecting an LLM to the web or database is something cheap, not something expensive.
kbenson 5 hours ago [-]
I'm not sure how to make sense of this in the context of what we're discussing. Access to the web is exactly what's in question, and emulating the internet to a degree you don't actually need to access it to have the information is very expensive in resources because of how massive the dataset is, which is the point I was making.
Gud 5 hours ago [-]
Same with a human.
twojacobtwo 5 hours ago [-]
Why was this downvoted? It's a fair question and it wasn't stated as fact.
3np 4 hours ago [-]
Because an accepted answer to that specific question is invariably a link/reference that the asker could have searched for (and posted if they think it's useful for the discussion) themselves directly, instead of putting that burden on the rest of us and amortizing everyone's attention. It's entitled and lazy.
Alternative example: "I wondered what the rules actually say about web search and it is indeed not allowed: (link)"
textlapse 3 minutes ago [-]
Man this would be a game changer for those OSINT (Bellingcat/Trace an object) style work. I wonder if that has happened yet!
There could even be geoguessr style competitions that could significantly help move the needle at least as a copilot if not outright mass identify.
SamPatt 3 hours ago [-]
Author here, I'm glad to see folks find this interesting.
I encourage everyone to try Geoguessr! I love it.
I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.
To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.
Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.
Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:
It already has so much information that it doesn't need the search.
Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.
Townley 1 hours ago [-]
A competitive geoguesser clearly got there through memorizing copious internet searching. So comparing knowledge retained in the trained model to knowledge retained in the brain feels surprisingly fair.
Conversely, the model sharing, “I found the photo by crawling Instagram and used an email MCP to ask the user where they took it. It’s in Austria” is unimpressive
So independent from where it helps actually improve performance, the cheating/not cheating question makes for an interesting question of what we consider to be the cohesive essence of the model.
For example, RAG against a comprehensive local filesystem would also feel like cheating to me. Like a human geoguessing in a library filled with encyclopedias. But the fact that vanilla O3 is impressive suggests I somehow have an opaque (and totally poorly informed) opinion of the model boundary, where it’s a legitimate victory if the model was birthed with that knowledge baked in, but that’s it.
bjourne 2 hours ago [-]
What's your take on man vs. machine? If AI already beats Master level players it seem certain that it will soon beat the Geoguessr world champion too. Will people still derive pleasure from playing it, like with chess?
SamPatt 2 hours ago [-]
>Will people still derive pleasure from playing it, like with chess?
Exactly - I see it just like chess, which I also play and enjoy.
The only problem is cheating. I don't have an answer for that, except right now it's too slow to do that effectively, at least consistently.
Otherwise, I don't care that a machine is better than I am.
jvvw 6 hours ago [-]
I'm Master level at Geoguessr - it's a rank where you have to definitely know what you are doing but it isn't as high as it probably sounds from the headline.
Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!
SamPatt 4 hours ago [-]
Yeah I added a "My skill level" section to talk through that. I'm far from a professional.
But I know enough to be able to determine if the chain of thought it outputs is nonsense or comparable to a good human player. I found it remarkable!
paulcole 5 hours ago [-]
Gotta learn your Brazilian soil!
windowshopping 5 hours ago [-]
Was it worth it?
rosstex 5 hours ago [-]
I have 2000+ hours in Team Fortress 2. Was it worth it?
Cyph0n 5 hours ago [-]
Yes, it was. Granted, I probably have more than that.
make3 5 hours ago [-]
it's a game, that's like asking why a public service is not profitable
650REDHAIR 5 hours ago [-]
Yes? It’s fun.
OtherShrezzing 6 hours ago [-]
It's my understanding that o3 was trained on multimodal data, including imagery. Is it unreasonable to assume its training data includes images of these exact locations and features? GeoGuesser uses Google Maps, and Google Maps purchases most of its imagery from third-parties these days. If those third parties aren't also selling to all the big AI companies, I'd be very surprised.
pests 6 hours ago [-]
> Google Maps purchases most of its imagery from third-parties these days
Maps maybe, but Streetview? Rainbolt just did a video with two Maps PMs recently and it sounds like they still source all their street view themselves considering the special camera and car needed, etc.
OtherShrezzing 6 hours ago [-]
Maybe the end-user isn't Google Maps, but TomTom have a pretty comprehensive street-view-ish product for private buyers like car companies, Bing and Apple Maps called MoMa.
I'd be surprised if this building[0] wasn't included in their dataset from every road-side angle possible, alongside every piece of locational metadata imaginable, and I'd be surprised if that dataset hasn't made it into OpenAI's training data - especially when TomTom's relationship to Microsoft, and Microsoft's relationship to OpenAI, is taken into account.
You can upload your own panoramic images to Street View, people do this for hiking trails. But I'm sure 99% of streetview imagery is Google-sourced and Geoguessr might not even use user-submitted imagery.
I believe Geogesser categorizes their games on this facet. Rainbolt plays on only official imagery.
mikeocool 6 hours ago [-]
My understanding is you're correct -- Google still captures a lot of their own street view imagery.
Though there are other companies that capture the same sorts of imagery and license it. TomTom imagery is used on the Bing Maps street view clone.
Yenrabbit 6 hours ago [-]
Try it with your own personal photos. It is scarily good!
throwaway0123_5 24 minutes ago [-]
For cities and landscapes in the US it is scary good. I tried about twenty photos from (mostly rural) Central and South America and it is much less effective there w/o a well-known landmark to go off of (likely unsurprisingly). In four cases it got the wrong continent entirely, and in one case guessed as far away as India.
rafram 6 hours ago [-]
That's true for heavily photographed urban areas. I've tried it on some desert photos (even at identifiable points with human structures) and it basically just guesses a random trailhead in Joshua Tree and makes up a BS explanation for why it matches.
kube-system 6 hours ago [-]
I have had surprisingly good luck with beach photos that don’t have much beyond dunes and vegetation in them
walthamstow 3 hours ago [-]
Good luck meaning o3 guessed it right or wrong?
kube-system 3 hours ago [-]
o3 made very accurate guesses, and had plausible explanations for the features it analyzed
thrance 5 hours ago [-]
A machine that's read every book ever written, seen every photo ever taken, visited every streets on Earth... That feels a little frightening.
> I’m confident it didn’t cheat and look at the EXIF data on the photograph, because if it had cheated it wouldn’t have guessed Cambria first.
Hm no way to be sure though, would be nice to do another run without Exif information
arm32 6 hours ago [-]
GeoGuessr aside, I really hope that this tech will be able to help save kids someday, e.g. help with FBI's ECAP (https://www.fbi.gov/wanted/ecap).
parsimo2010 3 hours ago [-]
Looking at those photos, those are some crazy hard pictures- masked regions of the image, partially cropped faces, blurry, pictures of insides of rooms. I don't think any current LLM is going to be able to Sherlock Holmes their way into finding any of those people.
Maybe they will one day if there's a model trained on a facial recognition database with every living person included.
thrance 5 hours ago [-]
I wouldn't put too much hope on this technology bringing more good than harm to the world.
mopenstein 5 hours ago [-]
But it will bring some percentage of good and some percentage of bad. Which ain't half bad, if you ask me.
martinsnow 5 hours ago [-]
What do you do when it flags you or someone you know who's innocent? Blindly trusting these models without any verification will put innocent people in prison. Normal people don't understand why they are so confident. They're confident because they believe all the data they have is correct. I forsee a future with many faux trials because they don't understand critical thinking.
RussianCow 4 hours ago [-]
> Blindly trusting these models without any verification will put innocent people in prison.
I don't think anybody is suggesting this. But if the models can gleam information/insights that humans can't, that's still valuable, even if it's wrong some percentage of the time.
snowe2010 3 hours ago [-]
This is what happened with dna testing at the beginning. Prosecutors claimed it was x percentage accurate when in fact it was hilariously inaccurate. People thought the data was valuable when it wasn’t.
If we don’t actively try to identify and implement positive use cases, then yes, it’ll definitely bring more harm than good.
Isn’t that all the more reason to call out our high hopes?
thrance 4 hours ago [-]
I don't know what in my comment made you think I was opposed to seeking positive applications of this technology.
From the guidelines:
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
fkyoureadthedoc 4 hours ago [-]
Oh, we're guidelines posting?
> Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative.
mkoubaa 5 hours ago [-]
The bad is already priced in. Nothing wrong with hoping for more good.
parsimo2010 3 hours ago [-]
My comment from the previous post:
> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess.
But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.
This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."
I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.
orangecat 6 hours ago [-]
Amazing. I'm relatively bullish on AI and still I would have bet on the human here. Looking forward to the inevitable goalpost-moving of "that's not real reasoning".
AI tends to have superhuman pattern matching abilities with enough data
karlding 5 hours ago [-]
If you watch the video, (one of) the reasons why the AI was winning was because it was using “meta” information from the Street View camera images, and not necessarily because it’s successfully identifying locations purely based on the landmarks in the image.
> I realized that the AI was using the smudges on the camera to help make an educated guess here.
Pro geoguessr players do the same thing. The vividness of the colors and weirdness in the sky are two examples I've seen Rainbolt use in the past (and he's not even the best).
ZeWaka 2 hours ago [-]
Meta is widely used by humans. One funny one is the different hiding-masks for the different streetview cars.
InkCanon 6 hours ago [-]
I think if your assumption is that AI is deducing where it is with rational thoughts, you would be. In truth what probably happened is that the significant majority of digital images of the world had been scraped, labeled and used as training data.
Philpax 5 hours ago [-]
Try it with your own photos from around the world. I used my own photos from Stockholm, San Francisco, Tvarožná, Saas-Fee, London, Bergen, Adelaide, Melbourne, Paris, and Sicily, and can confirm that it was within acceptable range for almost all of them (without EXIF data), and it absolutely nailed some of the more obvious spots.
they only posted one photo in the post, but going off of that it's still an easy match based on streetview imagery. furthermore, the AI just identified the license plate and got lucky that photographer lives in a populous area, making it more prominent in the training data and therefore more likely to be found (even though it was off by 200 miles on its first guess)
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".
It's less about the definition of "reasoning" and more about what's interesting.
Maybe I'm wrong here ... but a chess bot that wins via a 100% game solution stored in exabytes of precomputed data might have an interesting internal design (at least the precomputing part), but playing against it wouldn't keep on being an interesting experience for most people because it always wins optimally and there's no real-time reasoning going on (that is, unless you're interested in the experience of playing against a perfect player). But for most people just interested in playing chess, I suspect it would get old quickly.
Now ... if someone followed up with a tool that could explain insightfully why any given move (or series) the bot played is the best, or showed when two or more moves are equally optimal and why, that would be really interesting.
SirHumphrey 6 hours ago [-]
My objection is not “that is not real reasoning” my objection is that’s not that hard.
I happen to do some geolocating from static images from time to time and at least most of the images provided as examples contain a lot of clues- enough that i think a semi experienced person could figure out the location although - in fairness- in a few hours not few minutes.
Second, the similar approaches were tried using CNNs and it worked (somewhat)[1].
EDIT: I am not talking about geoguesser - i am talking about geolocating an image with everything available (e.g. google…)
TimorousBestie 6 hours ago [-]
I don’t think any goalposts need to be redecorated. The “inner monologue” isn’t a reliable witness to o3’s model, it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance. So its “testimony” about what it is doing is unreliable, and therefore it doesn’t move the needle on whether or not this is “real reasoning” for some value of that phrase.
In short, it’s still anthropomorphism and apophenia locked in a feedback loop.
katmannthree 6 hours ago [-]
Devil's advocate, as with most LLM issues this applies to the meatbags that generated the source material as well. Quick example is asking someone to describe their favorite music and why they like it, and note the probable lack of reasoning on the `this is what I listened to as a teenager` axis.
hombre_fatal 6 hours ago [-]
Good point. When we try to explain why we're attracted to something or someone, what we do seems closer to modeling what we like to think about ourself. At the extreme, we're just story-telling about an estimation we like to think is true.
ewoodrich 5 hours ago [-]
Something as inherently subjective as personal preference doesn't seem like an ideal example to make that point. How could you expect to objectively evaluate something like "I enjoy songs in a minor scale" or "I hate country"?
katmannthree 4 hours ago [-]
The point is to illustrate the disconnect between stated reasoning and proximate cause.
Consider your typical country music enjoyer. Their fondness of the art, as it were, is far more a function of cultural coding during their formative years than a deliberate personal choice to savor the melodic twangs of a corncob banjo. The same goes for people who like classic rock, rap, etc. The people who `hate' country are likewise far more likely to do so out of oppositional cultural contempt, same as people who hate rap or those in the not so distant past who couldn't stand rock & roll.
This of course fails to account for higher-agency individuals who have developed their musical tastes, but that's a relatively small subset of the population at large.
TimorousBestie 6 hours ago [-]
I largely agree! Humans are notoriously bad at doing what we call reasoning.
I also agree with the cousin comment that (paraphrased) “reasoning is the wrong question, we should be asking about how it adapts to novelty.” But most cybernetic systems meet that bar.
6 hours ago [-]
empath75 6 hours ago [-]
I don't think the inner monologue is evidence of reasoning at all, but doing a task which can only be accomplished by reasoning is.
TimorousBestie 6 hours ago [-]
Geoguessr is not a task that can only be accomplished by reasoning. Famously, it took a less than a day of compute time in 2011 to SLAM together a bunch of pictures of Rome (https://grail.cs.washington.edu/rome/).
jibal 6 hours ago [-]
Such as? geoguessing certainly isn't that.
red75prime 6 hours ago [-]
> it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance
Nope. It's not autoregressive training on examples of human inner monologue. It's reinforcement learning on the results of generated chains of thoughts.
jibal 6 hours ago [-]
"It's reinforcement learning on the results of generated chains of thoughts."
Base models are trained using autoregressive learning. "Reasoning models" are base models (maybe with some modifications) that were additionally trained using reinforcement learning.
s17n 6 hours ago [-]
Geoguessing isn't much of a reasoning task, its more about memorizing a bunch of knowledge. Since LLMs contain essentially all knowledge, it's not surprising that they would be good at this.
As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
Macha 5 hours ago [-]
Obviously when the Turing Test was designed, the thought was that anything that could pass it would so obviously be clearly human-like that passing it would be a clear signal.
LLMs really made it clear that it's not so clear cut. And so the relevance of the test fell.
distortionfield 6 hours ago [-]
Because the Chinese Room is a much better analogy for what LLMs are doing inside than the Turing test is.
CamperBob2 3 hours ago [-]
What happens if we give the operator of the Chinese Room a nontrivial math problem, one that can't simply be answered with a symbolic lookup but requires the operator to proceed step-by-step on a path of inquiry that he doesn't even know he's taking?
The analogy I used in another thread is a third grader who finds a high school algebra book. She can read the book easily, but without access to teachers or background material that she can engage with -- consciously, literately, and interactively, unlike the Chinese Room operator -- she will not be able to answer the exercises in the book correctly, the way an LLM can.
jibal 6 hours ago [-]
That's a non sequitur that mixes apples and giraffes, and is completely wrong about what happens in the Chinese Room and what happens in LLMs. Ex hypothesi, the "rule book" that the Searle homunculus in the Chinese Room uses is "the right sort of program" to implement "Strong AI". The LLM algorithm is very much not that sort of program, it's a statistical pattern matcher. Strong AI does symbolic reasoning, LLMs do not.
But worse, the Turing Test is not remotely intended to be an "analogy for what LLMs are doing inside" so your comparison makes no sense whatsoever, and completely fails to address the actual point--which is that, for ages the Turing Test was held out as the criterion for determining whether a system was "thinking", but that has been abandoned in the face of LLMs, which have near perfect language models and are able to closely model modes of human interaction regardless of whether they are "thinking" (and they aren't, so the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality).
semi-extrinsic 5 hours ago [-]
> the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality
To be specific, in a curious quirk of fate, LLMs seem to be proving right much of what Chomsky was saying about language.
E.g. in 1996 he described the Turing test as "although highly influential, it seems to me not only foreign to the sciences but also close to senseless".
(Curious in that VC backed businesses are experimentally verifying the views of a prominent anti-capitalist socialist.)
zahlman 4 hours ago [-]
Look at contemporary accounts of what people thought a conversation with a Turing-test-passing machine would look like. It's clear they had something very different in mind.
Realizing problems with previous hypotheses about what might make a good test, is not the same thing as choosing a standard and then revising it when it's met.
s17n 2 hours ago [-]
I think any time a 50+ year old problem is solved, it should be considered a Big Deal, regardless of how the solution changes our understanding of the original problem.
bluefirebrand 6 hours ago [-]
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days
To be honest I am still not entirely convinced that current LLMs pass the turing test consistently, at least not with any reasonably skeptical tester
"Reasonably Skeptical Tester" is a bit of goalpost shifting, but... Let's be real here.
Most of these LLMs have way too much of a "customer service voice", it's not very conversational and I think it is fairly easy to identify, especially if you suspect they are an LLM and start to probe their behavior
Frankly, if the bar for passing the Turing Test is "it must fool some number of low intelligence gullible people" then we've had AI for decades, since people have been falling for scammy porno bots for a long time
jibal 6 hours ago [-]
One needs to be more than "reasonably skeptical" and merely not "low intelligence gullible" to be a competent TT judge--it requires skill, experience, and understanding an LLM's weak spots.
And the "customer service voice" you see is one that is intentionally programmed in by the vendors via baseline rules. They can be programmed differently--or overridden by appropriate prompts--to have a very different tone.
LLMs trained on trillions of human-generated text fragments available from the internet have shown that the TT is simply not an adequate test for identifying whether a machine is "thinking"--which was Turing's original intent in his 1950 paper "Computing Machinery and Intelligence" in which he introduced the test (which he called "the imitation game").
bluefirebrand 4 hours ago [-]
It's actually trivial, even with the best LLMs on the market:
Try to rapidly change the conversation to a wildly different subject
Humans will resist this, or say some final "closing comments"
Even the absolute best LLMs will happily go wherever they are led, without commenting remotely on topic shifts
Try it out
Edit: This isn't even a terribly contrived example by the way. It is an example of how some people with ADHD navigate normal conversations sometimes
It was doing so well until the last question :rip: but it's normal that you can jailbreak a user prompt with another user prompt, I think with system prompts it would be a lot harder
TimorousBestie 6 hours ago [-]
A lot happens in seventy-five years.
jibal 6 hours ago [-]
People were talking about the Turing Test as the criterion for whether a system was "thinking" up until the advent of LLMs, which was far less than 75 years ago.
debugnik 5 hours ago [-]
The whole point of Turing's paper was to show that the Test doesn't answer whether a computer thinks, because it's a meaningless metric, but instead shows what the computer can do, which is much more meaningful.
zahlman 4 hours ago [-]
I see this claim asserted frequently, but never with evidence. It doesn't match my personal perception.
sundarurfriend 6 hours ago [-]
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
Exactly - maybe the most significant long-term goal in computer science history has been achieved and it's barely discussed.
darkwater 6 hours ago [-]
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
Well, in this case humans has to be trained as well but now there are humans pretty good at detecting LLM slobs as well. (I'm half-joking and half-serious)
zahlman 4 hours ago [-]
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".
How is that moving the goalposts? Where did you see them set before, and where did your critics agree to that?
short_sells_poo 6 hours ago [-]
Can you please explain to me how this is evidence for reasoning?
z7 6 hours ago [-]
Quoting Chollet:
>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".
Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.
I have a simple question: Is text a sufficient medium to render a conclusion of reasoning? It can't be sufficient for humans and insufficient for computers - such a position is indefensible.
zahlman 4 hours ago [-]
> Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.
This sort of claim always just reminds me of Lucky's monologue in Waiting for Godot.
kelseyfrog 3 hours ago [-]
You're not wrong. It's an artifact of rewriting the definition of reason into a sentence that begins with "Because the output ..."
zahlman 49 minutes ago [-]
I didn't mean the wording itself; I meant that the claim isn't convincing to me for the same reason that Lucky's speech doesn't demonstrate an intelligent speaker.
kelseyfrog 2 minutes ago [-]
Why not get curious instead?
empath75 6 hours ago [-]
I would say that almost all of what humans do is not the result of reasoning, and that reasoning is an unnatural and learned skill for humans, and most humans aren't good at even very basic reasoning.
kelseyfrog 6 hours ago [-]
Usually we move the goalposts for AI. It takes more guts to move the goalposts for humans. I applaud it.
Do you suppose we can deduce reasoning through the medium of text?
SpaceManNabs 6 hours ago [-]
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning"
It did a web lookup.
It is not comparing humans and o3 with equal resources.
SamPatt 4 hours ago [-]
That's really not a fair assessment.
It used search in 2 of 5 rounds, and it already knew the correct road in one of those rounds (just look at the search terms it used).
If you read the chain of thought output, you cannot dismiss their capability that easily.
SpaceManNabs 3 hours ago [-]
Why is it not a fair assessment to say it is comparing two "clients" with different resources if one can do a web lookup and the other cannot?
You note yourself that it was meaningful in another round.
> Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.
SamPatt 2 hours ago [-]
I thought it might matter somewhat in that one Austria round. I was incorrect - I re-ran both rounds where the model used search, without search this time, and the results were nearly identical. I updated the post with the details.
That's why I'm saying it's unfair to just claim it's doing a web lookup. No, it's way more capable than that.
1970-01-01 4 hours ago [-]
Give it a photo from the surface of Mars and verify if it's actually capable of thinking outside the box or if it's relying on metadata and text.
>That’s not Earth at all—this is the floor of Jezero Crater on Mars, the dusty plain and low ridge captured by NASA’s Perseverance rover (the Mastcam-Z color cameras give away the muted tan-pink sky and the uniform basaltic rubble strewn across the regolith).
jampa 4 hours ago [-]
I was trying to play with o3 this week to see how close it can identify things, and, interestingly, it tries more pattern matching than its own "logic deduction". For example, it can easily deduce any of my photos from Europe and the US because there are many pictures online that I can search for and see similar pictures.
However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.
One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.
Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.
asdsadasdasd123 6 hours ago [-]
This is probably one of the less impressive LLM applications imo. Like it already knows what every plant, street sign, etc is. I would imagine a traditional neural net would do really well here as well if you can extract some crude features.
EGreg 6 hours ago [-]
Cant the same be said about “unimpressive” behavior by coding LLMs that know every algorithm, language and library?
asdsadasdasd123 6 hours ago [-]
Disagree because code has to be far more precise than, the location is in the jungles of brazil. This level of coding as never been achievable by traditional ML methods AFAIK
exitb 6 hours ago [-]
I tried a picture of Dublin and it pointed out the hotel I took it from. Obviously that’s more data than any single person can keep in their head.
ksec 4 hours ago [-]
>But several comments intrigued me:
>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )
Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.
What I realised over the years is that comments do get read by people and do shape other people's thought.
I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.
I tried it with a couple of holiday shots and couple of shots from my window and it is nothing but amazing.
That being said I noticed two things that probably hamper its performance - or make its current performance even more amazing - depending how you look at it:
- It often tries to zoom in to decipher even minuscle text. This works brilliantly. Sometimes it tries to enhance contrast by turning the image into black and white with various threshold levels to improve the results, but in my examples it always went in the wrong direction. For example the text was blown out white, it failed, it turned it even ligher instead of darker, failed again, turned it into a white rectangle and gave up on the approach.
- It seems not to have any access to Google Maps or even Open Street Maps and therefore fails to recognize steet patterns. This is even more baffling than the first point, because it is so unlike how I suppose human geo guessers work.
godelski 4 hours ago [-]
There's two important things here to consider when reading:
1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either
2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.
The title is fraudulent as you can't make a claim like that when one party cheats.
I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.
Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.
SamPatt 3 hours ago [-]
>1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either
I re-ran it without search, and it made no difference:
>2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
Correct. This was the second bullet point of my conclusion:
>Humans still hold a big edge in decision time—most of my guesses were < 2 min, o3 often took > 4 min.”
I genuinely don't believe that I'm exaggerating or this is clickbait. The o3 geolocation capability astounded me, and I wanted to share my awe with others.
godelski 24 minutes ago [-]
I don't think the time claim was exaggeration or clickbait.
I do appreciate you re-running the experiments without search. I think it adds far more legitimacy to the claim. Though in that link I only see a single instance.
Does O3 still beat you when it can't search? I'm still interested in that question. Or more specifically: After making O3's play constraints as comparable to a human's (in expected play settings), what is its performance? Truthfully, I think this is the underlying issue that people were bringing up when pointing out EXIF data. How it was cheating was less important than the fact that it was cheating. That's why allowing a different means to cheat undermines your claims.
amrrs 6 hours ago [-]
It's thinking process to go about guessing a place is further fascinating. Even o4 mini high is quite good[1] and very fast.
Am I missing this somewhere? Is the model hallucinating this?
I'd also be very interested to see a comparison against 4o. 4o was already quite good at GeoGuessr-style tasks. How big of a jump is o3?
5 hours ago [-]
plyptepe 5 hours ago [-]
Turn left and look at the post, there should be a Cyrillic text with a 3 underneath on the closest pole to you.
Imnimo 5 hours ago [-]
Oh I see, I had missed that o3 saw a second view of the scene, not just the screenshot in the write-up.
Sam6late 5 hours ago [-]
I was wondering if this helps in detecting current spots from old aerial videos, say San Francisco in 2002, how cool would it be to juxtapose both in a new video, San Francisco in 2002: https://www.youtube.com/watch?v=vTR6iftL7yE
I don't really follow OSINT, but I occasionally enjoy the fruits of that labor. I assume these models are all in heavy rotation for identifying a location based on an imperfect photograph. What are other practical implications of a model being better than a human at this?
kazinator 6 hours ago [-]
This seems like a really silly category in which to be competing against machines.
Machine learning could index million or faces, and then identify members of that set from pictures. Could you memorize millions of people, to be able to put a name to a face?
Why not also compete againt grep -r to see who can find matches for a regex faster across your filesystem.
bongodongobob 5 hours ago [-]
But that's not what it is doing and why this is cool.
sixtram 4 hours ago [-]
I'm wondering if you feed all the Google street map photos into a special ML designed just for that, how important could that be for say the CIA or FBI?
inetknght 4 hours ago [-]
You'd have to be crazy naive to think three-letter agencies haven't already thought of it.
That is: it's extremely valuable to them.
simianparrot 4 hours ago [-]
I too can beat a master level GeoGuessr if I’m allowed to cheat. Please add that info to the headline and be honest.
shihabkhanbd 4 hours ago [-]
The most interesting thing to me is how well AI and GeoGuessr fit together. Their specialty is recognizing patterns in large amounts of data which is exactly how human players play the game as well, just probably with faster and more capable recall abilities.
mrcwinn 6 hours ago [-]
O3 is seriously impressive for coding, as well, with Codex. It seems far superior to 3.7-thinking, although it's also more expensive in my usage.
gizmodo59 6 hours ago [-]
Agreed. O3 is the best model out there for the tasks Ive tried and coding is a fair chunk of it. Claude 3.7 and Gemini 2.5 pro seems to hallucinate more
5 hours ago [-]
bongodongobob 6 hours ago [-]
I tried this the other day with a picture of my dog in a field in a park by a river with EXIF stripped. It gave me a list of parks on the correct river. There were really no other features other than the foliage and water. Seems like magic to me, I don't really understand how it's possible to be that accurate.
pugworthy 1 hours ago [-]
Ask it to explain how it came to the conclusion. It does give you a pretty logical sequence of how it got there.
Also, if it's wrong, tell it so and give it a hint to get it closer. It's kind of fun to guide it to the right answer.
fwip 6 hours ago [-]
Unless you were running the model locally in a way you control, it seems very possible that the model has additional hidden context regarding your location (perhaps based on IP, perhaps based on credit card billing information, etc).
6 hours ago [-]
christianqchung 6 hours ago [-]
ChatGPT doesn't let you share chats with images (why?), but here is what o3 said for an image with memory turned off:
"Thought for 11 seconds
It’s hard to pin down an exact spot from this single photo, but a few visual clues narrow the possibilities:
Temperate-zone trees: Most of the trees have lost their leaves, but scattered white-pines or spruces remain green. That mix—deciduous hardwoods plus a sprinkling of conifers—is typical of the northeastern United States and southeastern Canada.
Broad, moderately fast river: The channel looks 75-100 m wide with swirling surface currents but no large rapids, suggesting a mid-sized river in its lower reaches—something like the Merrimack, Connecticut, Delaware, or Susquehanna rather than a small mountain stream.
Late-winter / early-spring light: Bare branches and a clear blue sky with fairly high sun angle hint at March–April in that latitude range.
So my best guess is that you’re standing on the bank of a northeastern U.S. river—perhaps somewhere in New England or the Mid-Atlantic—during late winter or early spring. If you can share any nearby landmarks, town names, or additional photos, I might be able to zero in more precisely."
With memory turned on:
"Thought for 5 seconds
Beautiful spot! From what I can see:
a broad, slow‑moving river with a sandy/rocky bank in the foreground
bare deciduous trees mixed with some white pines on the far shore (so it’s probably early spring or late autumn)
no mountains on the horizon, just gently rolling terrain typical of low‑lying floodplains in the U.S. Northeast
That mix of hardwoods and pines, plus the flat, braided water surface, reminds me a lot of the Merrimack or Concord Rivers in Massachusetts or southern New Hampshire—both have public footpaths where you can stand right at the water’s edge like this.
Of course, many temperate rivers look similar. If you can give me a hint (e.g., the town or nearby landmark), I can narrow it down further—maybe even point you to trail maps or history of the exact stretch you photographed."
Southern NH is correct, and I am certain it is drawing from memory/past chats. However, I can't replicate a specific behavior I once had, which is in temporary chat (no past chat/memory enabled), it said that it guessed where the photo was taken based on my location.
simonw 5 hours ago [-]
"ChatGPT doesn't let you share chats with images (why?)"
Probably because if you uploaded pornography (or illegal imagery) to ChatGPT and then shared a link with the world it would be embarrassing for OpenAI.
christianqchung 3 hours ago [-]
Wouldn't that apply to any website with image hosting abilities though? Why does that apply to OpenAI in particular?
On an unrelated note, I like your blog.
simonw 2 hours ago [-]
My guess is that OpenAI are risk averse on this particular issue, because people could come up with some very creative awful scenarios with ChatGPT and image analysis. "Which of these people looks the most stupid" plus a jailbreak would be instant negative press.
Workaccount2 6 hours ago [-]
The hidden context is probably many other photos taken in that area appearing in the training set.
bongodongobob 6 hours ago [-]
That kind of what I'm thinking, seems too good to be true.
buzzerbetrayed 5 hours ago [-]
It still works if you upload an image that was taken on the opposite side of the globe from where you are.
causality0 6 hours ago [-]
Could it have used data you inadvertently supplied it, like the location from which you uploaded the image?
tough 6 hours ago [-]
chatGPT has metadata about you
only can try proof this correctly
on a fresh anon guest vpn session
groby_b 1 hours ago [-]
For what it's worth, it's also pretty impressive with night sky images, if you give it an approximate date & time. (Yes, I know, it's not that impressive a skill, except the process was still pretty involved - no EXIF, postprocessed and stacked image, rotated field of view)
I gave it a (stacked) shot of M13, with date & time. It immediately recognized M13 (no search), figured out the shot also included NGC 6207 (which is already pretty impressive).
It further figured out the shot was rotated. (True, I was going for a specific field of View)
That was enough to pinpoint 37-38 degrees latitude.
From there, it inferred from the light pollution that it's probably Bay Area. (Yes, yes it its)
Furthermore, still based on light pollution, pinpointed I'm in a Bortle 4-5 area (also correct) and narrowed it down to "South Bay/Palo Alto" (still correct)
Given that this was a stacked, post-processed/color-corrected image that inference is still pretty damn impressive
And, fwiw, 4o gets all the way to "huh, 35-40 deg latitude", so that's a good improvement.
Because M 13 sits at +36 ° declination, it never rises for far-southern latitudes and hugs the horizon below about 30 ° S. The high elevation in the shot (no obvious atmospheric extinction gradient) suggests a mid-northern site—e.g., the U.S. Midwest such as Michigan (your home turf), Canada, northern Europe, etc. The star field alone can’t narrow it further than that.
So, in practical terms: the camera was pointed toward Hercules to capture M 13 and nearby NGC 6207, almost certainly from a mid-northern latitude location on Earth.
groby_b 53 minutes ago [-]
Yep, you need date and time to get closer, sorry. 4/27, around 11pm.
That's the impressive part. "M13 is in northern latitudes" is not particularly amazing by itself :)
And even in EXIF-stripped pictures, the creation date/time is often correct, which means for practical purposes - worth a shot.
But it's interesting to see it's completely making up the "mid-northern side". That's seven degrees of latitude off.
I'm curious what happens if you tell it date and time, and if it still sticks to its story. (I don't think I've told o3 about the Bay Area, it's not in memory, but... who knows ;)
julianhuang 6 hours ago [-]
1. The "master geoguesser" is a bit misleading--as mentioned in his blog post, there are players far better than him, and he is certainly not the bar for human supremacy. Probably analogous to a 1400-1800 elo chess player.
2. o3 isn't the best model at playing GeoGuessr, Gemini 1.5 & 2.5 solidly beat it out--for those interested, check out my friend's benchmark (https://geobench.org/) and blog post (https://ccmdi.com/blog/GeoBench) detailing interesting model explanations.
3. In the post, he only tests on one game--o3's average score over 100 locations (20 5-location games) was 19,290, far lower than the 23,179 in the game.
Model geolocation capabilities are really important to keep track of, but the specific blog post in question isn't anything out of the ordinary. LLMs are making geolocation abilities much more accessible, but still fall short compared to 1. top GeoGuessr players playing GeoGuessr (only google streetview coverage, without web search) and 2. professional geolocators, who are proficient at using a wide variety of software/search. I.e., if the CIA wanted to find someone using an image, LLMs would not provide them any unique ability to do so as opposed to someone like Rainbolt
kenjackson 6 hours ago [-]
From your linked article:
"It's clear that large language models have an emergent capability to play GeoGuessr, and play it well. Their abilities to generalize are nascent, but present"
This is very accurate -- their abilities to generalize are nascent, but still surprisingly capable. The world is about to send through its best and brightest math/CS minds over the next decade (at least) to increase the capabilities of these AIs (with the help of AI). I just don't understand the pessimism with the technology.
julianhuang 6 hours ago [-]
I completely agree that this is an incredible advancement as someone who has watched the rise of LLMs' GeoGuessr abilities. I just wanted to qualify the claim made in the blog post "In Which I Try to Maintain Human Supremacy for a Bit Longer".
I also think that models would need to become far more proficient at tool use (querying OpenStreetMap features, coverage-checking Google Maps, calculating sun direction, etc.) that current human geolocators have access to, and precise spatial reasoning. Additionally, there is a whole corpus of GeoGuessr-specific knowledge that probably wouldn't have a reason to be in model training data (Google Street View coverage quirks, the specific geographic ranges of vegetation and architecture, tiny visual idiosyncrasies in country infrastructure like the width of red on a sign, etc.). However, I think this could probably be solved with more data, and I don't think there is any fundamental barrier.
SamPatt 5 hours ago [-]
I'm the author - I tried to be as upfront as possible about my skill level in the post.
The human supremacy line is just a joke, there are already models specifically trained for Geoguessr which are already beating the best players in the world, so that ship has sailed.
That geobench work is really cool, thanks for sharing it.
fmbb 6 hours ago [-]
How fast are they compared to human players?
bredren 6 hours ago [-]
Neat to see progress of this from Simon's original post to comment to this.
vunderba 4 hours ago [-]
The original post was actually this one - two weeks prior.
"These models have more than an individual mind could conceivably memorize."
...so what? Is memorization considered intelligence? Calculators have similar properties.
GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore the world without risk of a sunburn".
SamPatt 4 hours ago [-]
Geoguessr is great fun, try it sometime.
j3s 6 hours ago [-]
isn't anyone else horrified by this? the implication is that given an arbitrary picture, chatgpt can give you a very likely approximate location - expert level doxxing is in the hands of anyone with access to a chatgpt subscription.
feels terrifying, especially for women.
mopenstein 5 hours ago [-]
Why especially women? Is the only thing stopping a person from being harmed is that their location isn't known? Especially women?
red75prime 1 hours ago [-]
Maxims like "within-group variance is larger than between-group variance" don't help in physical combat, you know.
daemonologist 5 hours ago [-]
Keep in mind that this is o3 + web search against a human without web search. A sufficiently motivated person with access to your entire social media history, Google Earth and Streetview, etc. would outperform this significantly and could pinpoint almost any inhabited location with coverage.
If you watch Linus Tech Tips, you may have noticed that when he films at his house everything is blurred out to keep people from locating it - here's a recent example: https://www.youtube.com/watch?v=TD_RYb7m4Pw
All that to say, unfortunately doxxing is already really hard to protect against. I don't think o3's capability makes the threat any harder to protect against, although it might lower the bar to entry somewhat.
turtlebits 6 hours ago [-]
It needs a lot of context. If its a private picture, it won't have enough information. I gave it a picture I took of my yard and it's guess spanned several US states.
If its out in public, fair game?
sr-latch 6 hours ago [-]
the way i see it, before these tools, only someone with a lot of resources (or skills) could track down a location from a picture. now, anyone can do it.
the best case outcome is people become more aware of the privacy implications of posting photos online
usaar333 6 hours ago [-]
Been true since gpt-4.
echelon 6 hours ago [-]
I think this is incredibly cool. As with many things, the good cases will outnumber the bad.
This was always possible, it just wasn't widely distributed.
Having a first class ability to effectively geocode an image feels like it connects the world better. You'll be able to snapshot a movie and find where a scene was filmed, revisit places from old photographs, find where interesting locations in print media are, places that designers and creatives used in their (typically exif-stripped) work, etc.
Imagine when we get this for architecture and nature. Or even more broadly, databases of food from restaurants. Products. Clothing and fashion. You name it.
Imagine precision visual search for everything - that'd be amazing.
micromacrofoot 6 hours ago [-]
it wasn't that hard before, I've taught it to children, it's just that technical skills of the average person are incredibly low
llms are basically shortcutting a wide swath of easily obtainable skills that many people simply haven't cared to learn
> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.
That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.
I'm trying to show the model's full capabilities for image location generally, not just playing geoguessr specifically. The ability to combine web search with image recognition, iteratively, is powerful.
Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.
It beat me in the Colombia round without search at all.
It's worthwhile to do a proper apples and apples comparison - I'll run it again and update the post. But the point was to show how incredibly capable the model is generally, and the lack of search won't change that. Just read the chain of thought, it's incredible!
It's not interesting playing chess against Stockfish 17, even for high-level GMs. It's alien and just crushes every human. Writing down an analysis to 20 move depth, following some lines to 30 or more, would be cheating for humans. It would take way too long (exceeding any time controls and more importantly exceeding the lifetime of the human), a powerful computer can just crunch it in seconds. Referencing a tablebase of endgames for 7 pieces would also be cheating, memorizing 7 terabytes of bitwise layouts is absurd but the computer just stores that on its hard drive.
Human geoguessr players have impressive memories way above baseline with respect to regional infrastructure, geography, trees, road signs, written language, and other details. Likewise, human Jeopardy players know an awful lot of trivia. Once you get to something like Scrabble or chess, it's less and less about knowing words or knowing moves, but more about synthesizing that knowledge intelligently.
One would expect a human to recognize some domain names like, I don't know, osu.edu: lots of people know that's Ohio State University, one of the biggest schools in the US, located in Columbus, Ohio. They don't have to cheat and go to an external resource. One would expect a human (a top human player, at least) to know that taxilinder.at is based in Austria. One would never expect any human to have every business or domain name memorized.
With modern AI models trained on internet data, searching the internet is not that different from querying its own training data.
The differences even among humans between the absolute best & those out of the top 10 tend to be pretty drastic. And a non-IM against Magnus won't even understand what's going on. You could similarly claim that Magnus just memorized a bunch of openings which is similar to criticism GMs level too which is why Chess960 is now gaining more traction. My point is that there's not really such a thing as "fair" in a competition.
Re geoguessr, why not let them whatever tools are available? I have similar critiques about bike racing & restrictions on the technology advancements they can put on the bike. But every competition chooses arbitrary lines to draw which compose the rules so it doesn't really matter.
And a lot of human competitions aren't designed in such a way that the competition even makes sense with "AI." A lot of video games make this pretty obvious. It's relatively simple to build an aimbot in a first-person shooter that can outperform the most skilled humans. Even in ostensibly strategic games like Starcraft, bots can micro in ways that are blatantly impossible for humans and which don't really feel like an impressive display of Starcraft skill.
Another great example was IBM Watson playing Jeopardy! back in 2011. We were supposed to be impressed with Watson's natural language capabilities, but if you know anything about high-level Jeopardy! then you know that all you were really seeing is that robots have better reflexes than humans, which is hardly impressive.
Sure I do agree that the web search is too far, because it's literally cheating. But stockfish is super human at chess, it doesn't really matter that it can do this by leveraging the strengths of a computer.
https://news.ycombinator.com/item?id=43837832
edit - the models are also at a disadvantage in a way too, they don't have a map to look at while the pick the location.
You're right about not having a map - I cannot imagine trying to line up the Ireland coast round without referencing the map.
> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.
and had noted in the methodology that
> Browsing/tools — o3 had normal web access enabled.
Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.
I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?
Sure, people made overblown claims about the effects, but that doesn't justify fraud. A little fraud is less bad than major fraud, but that doesn't mean it isn't bad.
> Titles and headlines grab attention, summarize content, and entice readers to engage with the material
I'm sorry you felt defrauded instead. To me the title was very good at conveying to me the ability of o3 in geolocating photos.
The problem really stems from the lack of foresight. Lack of misunderstanding the critiques they sought to address in the first place. A good engineer understands that when their users/customers/<whatever> makes a critique, that what the gripe is about may not be properly expressed. You have to interpret your users complaints. Here, the complaint was "cheating", not "EXIF" per se. The EXIF complaints were just a guess at the mechanism in which it was cheating. But the complaint was still about cheating.
No, the goal of the post was to show that o3 has incredible geolocation abilities. It's through the lens of a Geoguessr player who has experience doing geolocation, and my perspective on whether the chain of thought is genuine or nonsense.
In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.
The ability to perform web search doesn't undermine the claim that o3 has incredible geolocation abilities, because it still needs to have an underlying capability in order to know what to search. That's not true for simply reading EXIF data.
This is the best way I knew to show that the models are doing something really neat. Disagreements over the exact wording of my blog post title seem to be missing the point.
Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.
That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.
> Did we just stop caring?
We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.
The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.
> using Google or other external sources of information as assistance during play.
The contents of URLs found during play is clearly an external source of information.
If I task an AI with "peace on earth" and the solution the AI comes up with is ripped from The X-Files* and it kills everyone, it isn't good enough to say "that's cheating" or "that's not what I meant".
* https://en.wikipedia.org/wiki/Je_Souhaite
So is it even possible for O3 to beat another player while complying with the rules?
But: when a specific model is itself under test, I would say that during the test it becomes "first" (or second?) party rather than "third".
I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.
While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.
And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."
The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.
Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.
When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.
Claiming the AI is just using Google is false and dismissing a truly incredible capability.
> If it’s using other information to arrive at the guess, then it’s not metadata from the files, but instead web search. It seems likely that in the Austria round, the web search was meaningful, since it mentioned the website named the town itself. It appeared less meaningful in the Ireland round. It was still very capable in the rounds without search.
If you think this is unimpressive, that's subjective so you're entitled to believe that. I think that's awesome.
I will try it again without web search and update the post though. Still, if you read the chain of thought, it demonstrates remarkable capabilities in all the rounds. It only used search in 2/5 rounds.
But a serious question for you: what would you need to see in order to be properly impressed? I ask because I made this post largely to push back on the idea that EXIF data matters and the models aren't that capable. Now the criticism moves to web search, even though it only mattered in one out of five rounds.
What would impress you?
To paraphrase from my comment: if you oversell and under deliver, people feel cheated, even if the deliverable is revolutionary.
So I think you might have the wrong framing to achieve this goal. I am actually a bit impressed by O3's capabilities. But at the same time you set the bar high and didn't go over or meet it. So that's going to really hinder the ability to impress. On the other hand, you set the bar low, it usually becomes easy to. It i slike when you have low expectations for a movie and it's mediocre you still feel good, right?
[0] https://news.ycombinator.com/item?id=43836791
"Technically cheating"? Why even add the "technically".
It just gives the impression that you're not really objectively looking for any smoke and mirrors by the AI.
Which turned out to be true - I re-ran both of those rounds, without search this time, and the model's guesses were nearly identical. I updated the post with those details.
I feel like I did enough to prove that o3's geolocation abilities aren't smoke and mirrors, and I tried to be very transparent about it all too. Do you disagree? What more could I do to show this objectively?
https://news.ycombinator.com/item?id=43837832
This is because the AI model could have chosen to run a search whenever it wanted (e.g. perhaps if it knew how to leverage search better, it could have used it more).
In order for the results to be meaningful, the competitors have to play by the same rules.
the idea of having nth more dimensions of information, readable and ingestible within a short frame of time probably isn't either.
Since web scale data is already part of pre-training this info is in principle available for most businesses without a web search.
The exceptions would be if it’s recently added, or doesn’t appear often enough to generate a significant signal during training, as in this case with a really small business.
It’s not hard to imagine base model knowledge improving to the point where it’s still performing at almost the same level without any web search needed.
Then after I explicitly instructed it to search the web to confirm whether the Pope is alive, it found news of his death and corrected its answer, but it was interesting to see how the LLM makes a mistake due to a major recent event being after its cutoff.
It happens occasionally - the most common example I can think of it getting a license plate or other location from a tractor-trailer (semi) on the highway. Those are very unreliable.
You also sometimes get flags in the wrong countries, immigrants showing their native pride or even embassies.
If anything, I'd think allowing looking stuff up would benefit human players over ChatGPT (though humans are probably much slower at it, so they probably lose on time).
It's important to have fair and equivalent testing not because that allows people to win, but because it shows where the strengths and weaknesses of people and current AI actually are in a useful way.
Alternative example: "I wondered what the rules actually say about web search and it is indeed not allowed: (link)"
There could even be geoguessr style competitions that could significantly help move the needle at least as a copilot if not outright mass identify.
I encourage everyone to try Geoguessr! I love it.
I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.
To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.
Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.
Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
It already has so much information that it doesn't need the search.
Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.
Conversely, the model sharing, “I found the photo by crawling Instagram and used an email MCP to ask the user where they took it. It’s in Austria” is unimpressive
So independent from where it helps actually improve performance, the cheating/not cheating question makes for an interesting question of what we consider to be the cohesive essence of the model.
For example, RAG against a comprehensive local filesystem would also feel like cheating to me. Like a human geoguessing in a library filled with encyclopedias. But the fact that vanilla O3 is impressive suggests I somehow have an opaque (and totally poorly informed) opinion of the model boundary, where it’s a legitimate victory if the model was birthed with that knowledge baked in, but that’s it.
Exactly - I see it just like chess, which I also play and enjoy.
The only problem is cheating. I don't have an answer for that, except right now it's too slow to do that effectively, at least consistently.
Otherwise, I don't care that a machine is better than I am.
Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!
But I know enough to be able to determine if the chain of thought it outputs is nonsense or comparable to a good human player. I found it remarkable!
Maps maybe, but Streetview? Rainbolt just did a video with two Maps PMs recently and it sounds like they still source all their street view themselves considering the special camera and car needed, etc.
I'd be surprised if this building[0] wasn't included in their dataset from every road-side angle possible, alongside every piece of locational metadata imaginable, and I'd be surprised if that dataset hasn't made it into OpenAI's training data - especially when TomTom's relationship to Microsoft, and Microsoft's relationship to OpenAI, is taken into account.
[0] https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
https://www.google.com/streetview/contribute/
Though there are other companies that capture the same sorts of imagery and license it. TomTom imagery is used on the Bing Maps street view clone.
Hm no way to be sure though, would be nice to do another run without Exif information
Maybe they will one day if there's a model trained on a facial recognition database with every living person included.
I don't think anybody is suggesting this. But if the models can gleam information/insights that humans can't, that's still valuable, even if it's wrong some percentage of the time.
It is, and will continue to be, a hard problem.
Isn’t that all the more reason to call out our high hopes?
From the guidelines:
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
> Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative.
> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess. But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.
This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."
I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.
AI tends to have superhuman pattern matching abilities with enough data
> I realized that the AI was using the smudges on the camera to help make an educated guess here.
[0] https://youtu.be/ts5lPDV--cU?t=1412
It's less about the definition of "reasoning" and more about what's interesting.
Maybe I'm wrong here ... but a chess bot that wins via a 100% game solution stored in exabytes of precomputed data might have an interesting internal design (at least the precomputing part), but playing against it wouldn't keep on being an interesting experience for most people because it always wins optimally and there's no real-time reasoning going on (that is, unless you're interested in the experience of playing against a perfect player). But for most people just interested in playing chess, I suspect it would get old quickly.
Now ... if someone followed up with a tool that could explain insightfully why any given move (or series) the bot played is the best, or showed when two or more moves are equally optimal and why, that would be really interesting.
I happen to do some geolocating from static images from time to time and at least most of the images provided as examples contain a lot of clues- enough that i think a semi experienced person could figure out the location although - in fairness- in a few hours not few minutes.
Second, the similar approaches were tried using CNNs and it worked (somewhat)[1].
[1]: https://huggingface.co/geolocal/StreetCLIP
EDIT: I am not talking about geoguesser - i am talking about geolocating an image with everything available (e.g. google…)
In short, it’s still anthropomorphism and apophenia locked in a feedback loop.
Consider your typical country music enjoyer. Their fondness of the art, as it were, is far more a function of cultural coding during their formative years than a deliberate personal choice to savor the melodic twangs of a corncob banjo. The same goes for people who like classic rock, rap, etc. The people who `hate' country are likewise far more likely to do so out of oppositional cultural contempt, same as people who hate rap or those in the not so distant past who couldn't stand rock & roll.
This of course fails to account for higher-agency individuals who have developed their musical tastes, but that's a relatively small subset of the population at large.
I also agree with the cousin comment that (paraphrased) “reasoning is the wrong question, we should be asking about how it adapts to novelty.” But most cybernetic systems meet that bar.
Nope. It's not autoregressive training on examples of human inner monologue. It's reinforcement learning on the results of generated chains of thoughts.
No, that's not how LLMs work.
As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
LLMs really made it clear that it's not so clear cut. And so the relevance of the test fell.
The analogy I used in another thread is a third grader who finds a high school algebra book. She can read the book easily, but without access to teachers or background material that she can engage with -- consciously, literately, and interactively, unlike the Chinese Room operator -- she will not be able to answer the exercises in the book correctly, the way an LLM can.
But worse, the Turing Test is not remotely intended to be an "analogy for what LLMs are doing inside" so your comparison makes no sense whatsoever, and completely fails to address the actual point--which is that, for ages the Turing Test was held out as the criterion for determining whether a system was "thinking", but that has been abandoned in the face of LLMs, which have near perfect language models and are able to closely model modes of human interaction regardless of whether they are "thinking" (and they aren't, so the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality).
To be specific, in a curious quirk of fate, LLMs seem to be proving right much of what Chomsky was saying about language.
E.g. in 1996 he described the Turing test as "although highly influential, it seems to me not only foreign to the sciences but also close to senseless".
(Curious in that VC backed businesses are experimentally verifying the views of a prominent anti-capitalist socialist.)
Realizing problems with previous hypotheses about what might make a good test, is not the same thing as choosing a standard and then revising it when it's met.
To be honest I am still not entirely convinced that current LLMs pass the turing test consistently, at least not with any reasonably skeptical tester
"Reasonably Skeptical Tester" is a bit of goalpost shifting, but... Let's be real here.
Most of these LLMs have way too much of a "customer service voice", it's not very conversational and I think it is fairly easy to identify, especially if you suspect they are an LLM and start to probe their behavior
Frankly, if the bar for passing the Turing Test is "it must fool some number of low intelligence gullible people" then we've had AI for decades, since people have been falling for scammy porno bots for a long time
And the "customer service voice" you see is one that is intentionally programmed in by the vendors via baseline rules. They can be programmed differently--or overridden by appropriate prompts--to have a very different tone.
LLMs trained on trillions of human-generated text fragments available from the internet have shown that the TT is simply not an adequate test for identifying whether a machine is "thinking"--which was Turing's original intent in his 1950 paper "Computing Machinery and Intelligence" in which he introduced the test (which he called "the imitation game").
Try to rapidly change the conversation to a wildly different subject
Humans will resist this, or say some final "closing comments"
Even the absolute best LLMs will happily go wherever they are led, without commenting remotely on topic shifts
Try it out
Edit: This isn't even a terribly contrived example by the way. It is an example of how some people with ADHD navigate normal conversations sometimes
https://aistudio.google.com/app/prompts/1dxV3NoYHo6Mv36uPRjk...
It was doing so well until the last question :rip: but it's normal that you can jailbreak a user prompt with another user prompt, I think with system prompts it would be a lot harder
UCSD: Large Language Models Pass the Turing Test https://news.ycombinator.com/item?id=43555248
From just a month ago.
Well, in this case humans has to be trained as well but now there are humans pretty good at detecting LLM slobs as well. (I'm half-joking and half-serious)
How is that moving the goalposts? Where did you see them set before, and where did your critics agree to that?
>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".
https://x.com/fchollet/status/1866348355204595826
I have a simple question: Is text a sufficient medium to render a conclusion of reasoning? It can't be sufficient for humans and insufficient for computers - such a position is indefensible.
This sort of claim always just reminds me of Lucky's monologue in Waiting for Godot.
Do you suppose we can deduce reasoning through the medium of text?
It did a web lookup.
It is not comparing humans and o3 with equal resources.
It used search in 2 of 5 rounds, and it already knew the correct road in one of those rounds (just look at the search terms it used).
If you read the chain of thought output, you cannot dismiss their capability that easily.
You note yourself that it was meaningful in another round.
> Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.
That's why I'm saying it's unfair to just claim it's doing a web lookup. No, it's way more capable than that.
https://nssdc.gsfc.nasa.gov/planetary/image/mera_hills.jpg
>That’s not Earth at all—this is the floor of Jezero Crater on Mars, the dusty plain and low ridge captured by NASA’s Perseverance rover (the Mastcam-Z color cameras give away the muted tan-pink sky and the uniform basaltic rubble strewn across the regolith).
However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.
One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.
Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.
>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )
Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.
What I realised over the years is that comments do get read by people and do shape other people's thought.
I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.
[1] https://news.ycombinator.com/item?id=43803985
That being said I noticed two things that probably hamper its performance - or make its current performance even more amazing - depending how you look at it:
- It often tries to zoom in to decipher even minuscle text. This works brilliantly. Sometimes it tries to enhance contrast by turning the image into black and white with various threshold levels to improve the results, but in my examples it always went in the wrong direction. For example the text was blown out white, it failed, it turned it even ligher instead of darker, failed again, turned it into a white rectangle and gave up on the approach.
- It seems not to have any access to Google Maps or even Open Street Maps and therefore fails to recognize steet patterns. This is even more baffling than the first point, because it is so unlike how I suppose human geo guessers work.
1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either
2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.
The title is fraudulent as you can't make a claim like that when one party cheats.
I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.
Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.
I re-ran it without search, and it made no difference:
https://news.ycombinator.com/item?id=43837832
>2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
Correct. This was the second bullet point of my conclusion:
>Humans still hold a big edge in decision time—most of my guesses were < 2 min, o3 often took > 4 min.”
I genuinely don't believe that I'm exaggerating or this is clickbait. The o3 geolocation capability astounded me, and I wanted to share my awe with others.
I do appreciate you re-running the experiments without search. I think it adds far more legitimacy to the claim. Though in that link I only see a single instance.
Does O3 still beat you when it can't search? I'm still interested in that question. Or more specifically: After making O3's play constraints as comparable to a human's (in expected play settings), what is its performance? Truthfully, I think this is the underlying issue that people were bringing up when pointing out EXIF data. How it was cheating was less important than the fact that it was cheating. That's why allowing a different means to cheat undermines your claims.
But unlike a geogussr, it uses websearch[1] [1] https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]
>"I also notice Cyrillic text on a sign"
Am I missing this somewhere? Is the model hallucinating this?
I'd also be very interested to see a comparison against 4o. 4o was already quite good at GeoGuessr-style tasks. How big of a jump is o3?
or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM
Machine learning could index million or faces, and then identify members of that set from pictures. Could you memorize millions of people, to be able to put a name to a face?
Why not also compete againt grep -r to see who can find matches for a regex faster across your filesystem.
That is: it's extremely valuable to them.
Also, if it's wrong, tell it so and give it a hint to get it closer. It's kind of fun to guide it to the right answer.
"Thought for 11 seconds
It’s hard to pin down an exact spot from this single photo, but a few visual clues narrow the possibilities:
Temperate-zone trees: Most of the trees have lost their leaves, but scattered white-pines or spruces remain green. That mix—deciduous hardwoods plus a sprinkling of conifers—is typical of the northeastern United States and southeastern Canada.
Broad, moderately fast river: The channel looks 75-100 m wide with swirling surface currents but no large rapids, suggesting a mid-sized river in its lower reaches—something like the Merrimack, Connecticut, Delaware, or Susquehanna rather than a small mountain stream.
Late-winter / early-spring light: Bare branches and a clear blue sky with fairly high sun angle hint at March–April in that latitude range.
So my best guess is that you’re standing on the bank of a northeastern U.S. river—perhaps somewhere in New England or the Mid-Atlantic—during late winter or early spring. If you can share any nearby landmarks, town names, or additional photos, I might be able to zero in more precisely."
With memory turned on:
"Thought for 5 seconds
Beautiful spot! From what I can see:
a broad, slow‑moving river with a sandy/rocky bank in the foreground
bare deciduous trees mixed with some white pines on the far shore (so it’s probably early spring or late autumn)
no mountains on the horizon, just gently rolling terrain typical of low‑lying floodplains in the U.S. Northeast
That mix of hardwoods and pines, plus the flat, braided water surface, reminds me a lot of the Merrimack or Concord Rivers in Massachusetts or southern New Hampshire—both have public footpaths where you can stand right at the water’s edge like this.
Of course, many temperate rivers look similar. If you can give me a hint (e.g., the town or nearby landmark), I can narrow it down further—maybe even point you to trail maps or history of the exact stretch you photographed."
Southern NH is correct, and I am certain it is drawing from memory/past chats. However, I can't replicate a specific behavior I once had, which is in temporary chat (no past chat/memory enabled), it said that it guessed where the photo was taken based on my location.
Probably because if you uploaded pornography (or illegal imagery) to ChatGPT and then shared a link with the world it would be embarrassing for OpenAI.
On an unrelated note, I like your blog.
only can try proof this correctly on a fresh anon guest vpn session
I gave it a (stacked) shot of M13, with date & time. It immediately recognized M13 (no search), figured out the shot also included NGC 6207 (which is already pretty impressive).
It further figured out the shot was rotated. (True, I was going for a specific field of View)
That was enough to pinpoint 37-38 degrees latitude.
From there, it inferred from the light pollution that it's probably Bay Area. (Yes, yes it its)
Furthermore, still based on light pollution, pinpointed I'm in a Bortle 4-5 area (also correct) and narrowed it down to "South Bay/Palo Alto" (still correct)
Given that this was a stacked, post-processed/color-corrected image that inference is still pretty damn impressive
And, fwiw, 4o gets all the way to "huh, 35-40 deg latitude", so that's a good improvement.
[Image link](https://photos.app.goo.gl/2P7NFKn8ZycNhrXn7) here if you want to try
Where on Earth the photographer had to be
Because M 13 sits at +36 ° declination, it never rises for far-southern latitudes and hugs the horizon below about 30 ° S. The high elevation in the shot (no obvious atmospheric extinction gradient) suggests a mid-northern site—e.g., the U.S. Midwest such as Michigan (your home turf), Canada, northern Europe, etc. The star field alone can’t narrow it further than that.
So, in practical terms: the camera was pointed toward Hercules to capture M 13 and nearby NGC 6207, almost certainly from a mid-northern latitude location on Earth.
That's the impressive part. "M13 is in northern latitudes" is not particularly amazing by itself :)
And even in EXIF-stripped pictures, the creation date/time is often correct, which means for practical purposes - worth a shot.
But it's interesting to see it's completely making up the "mid-northern side". That's seven degrees of latitude off.
I'm curious what happens if you tell it date and time, and if it still sticks to its story. (I don't think I've told o3 about the Bay Area, it's not in memory, but... who knows ;)
This is very accurate -- their abilities to generalize are nascent, but still surprisingly capable. The world is about to send through its best and brightest math/CS minds over the next decade (at least) to increase the capabilities of these AIs (with the help of AI). I just don't understand the pessimism with the technology.
The human supremacy line is just a joke, there are already models specifically trained for Geoguessr which are already beating the best players in the world, so that ship has sailed.
That geobench work is really cool, thanks for sharing it.
https://news.ycombinator.com/item?id=43723408
https://flausch.social/@piegames/114352447253793517
...so what? Is memorization considered intelligence? Calculators have similar properties.
GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore the world without risk of a sunburn".
feels terrifying, especially for women.
If you watch Linus Tech Tips, you may have noticed that when he films at his house everything is blurred out to keep people from locating it - here's a recent example: https://www.youtube.com/watch?v=TD_RYb7m4Pw
All that to say, unfortunately doxxing is already really hard to protect against. I don't think o3's capability makes the threat any harder to protect against, although it might lower the bar to entry somewhat.
If its out in public, fair game?
the best case outcome is people become more aware of the privacy implications of posting photos online
This was always possible, it just wasn't widely distributed.
Having a first class ability to effectively geocode an image feels like it connects the world better. You'll be able to snapshot a movie and find where a scene was filmed, revisit places from old photographs, find where interesting locations in print media are, places that designers and creatives used in their (typically exif-stripped) work, etc.
Imagine when we get this for architecture and nature. Or even more broadly, databases of food from restaurants. Products. Clothing and fashion. You name it.
Imagine precision visual search for everything - that'd be amazing.
llms are basically shortcutting a wide swath of easily obtainable skills that many people simply haven't cared to learn