AI Responses May Include Mistakes

The other day I wanted to look up a specific IBM PS/2 model, a circa 1992 PS/2 Server system. So I punched the model into Google, and got this:

That did not look quite right, since the machine I was looking for had 486 processors (yes, plural). And it most certainly did use Microchannel (MCA).

Alright, let’s try again:

Simply re-running the identical query produces a different summary. Although the AI still claims that PS/2 Model 280 is an ISA-based 286 system. Maybe the third time is the charm?

The AI is really quite certain that PS/2 Model 280 was a 286-based system released in 1987, and I was really looking for a newer machine. Interestingly, the first time the AI claimed Model 280 had 1MB RAM expandable to 6MB, and now it supposedly only has 640 KB RAM. But the AI seems sure that Model 280 had a 1.44 MB drive and VGA graphics.

What if we try again? After a couple of attempts, yet different answer pops up:

Oh look, now the PS/2 Model 280 is a 286 expandable to 128 MB RAM. Amazing! Never mind that the 286 was architecturally limited to 16 MB.

Even better, the AI now tells us that “PS/2 Model 280 was a significant step forward in IBM’s personal computer line, and it helped to establish the PS/2 as a popular and reliable platform.”

The only problem with all that? There is no PS/2 Model 280, and never was. I simply had the model number wrong. The Google AI just “helpfully” hallucinates something that at first glance seems quite plausible, but is in fact utter nonsense.

But wait, that’s not the end of the story. If you try repeating the query often enough, you might get this:

That answer is actually correct! “Model 280 was not a specific model in the PS/2 series”, and there was in fact an error in the query.

Here’s another example of a correct answer:

Unfortunately the correct answer comes up maybe 10% of the time when repeating the query, if at all. In the vast majority of attempts, the AI simply makes stuff up. I do not consider made up, hallucinated answers useful, in fact they are worse than useless.

This minor misadventure might provide a good window into AI-powered Internet search. To a non-expert, the made up answers will seem highly convincing, because there is a lot of detail and overall the answer does not look like junk.

An expert will immediately notice discrepancies in the hallucinated answers, and will follow for example the List of IBM PS/2 Models article on Wikipedia. Which will very quickly establish that there is no Model 280.

The (non-expert) users who would most benefit from an AI search summary will be the ones most likely misled by it.

How much would you value a research assistant who gives you a different answer every time you ask, and although sometimes the answer may be correct, the incorrect answers look, if anything, more “real” than the correct ones?

When Google says “AI responses may include mistakes”, do not take it lightly. The AI generated summary could be utter nonsense, and just because it sounds convincing doesn’t mean it has anything to do with reality. Caveat emptor!

This entry was posted in IBM, PS/2. Bookmark the permalink.

26 Responses to AI Responses May Include Mistakes

  1. Stu says:

    I’ve often noticed that LLMs are really bad at “admitting” that don’t “know” something. They’ll pretty much always give a plausible-looking (at least at first glance) answer to more-or-less any question that’s not obviously nonsense…

    Of course, that also means they’ll hallucinate plausible answers to questions based on incorrect premises.

    I can only imagine the implications for “vibe coding”…

  2. MiaM says:

    Re “vibe coding”: I wonder what the quality of AI generated code is depending on what skills the one using the AI has? Like do the comparisons and whatnot take into account that whoever is most likely to write the queries is probably far from an expert in the field.

    Speaking of incorrect results: It would be great if the search engines keep track of changes in major things people search for. When commenting, I was about to use the auto fill feature in Firefox and got annoyed by it had incorrectly saved some random junk as an alternative to my name. Googled it, the first result was totally incorrect. Adding a limit to max one year old results gave the correct suggestion (type the first letter of the incorrect suggestion, hold shift, select the incorrect suggestion, keep holding shift and press delete. Poof, the incorrect suggestion is gone!)

  3. zeurkous says:

    This is why me avoids the term “AI” — it’s more of an AS, really —
    in favour of “ML” (machine learning).

    In fact, me’s always understood that ML was one of the things to come
    out of the grand *failure* of the {6,7}0s AI projects. (The problem?
    after extensive research into replicating intelligence, they found out
    they couldn’t really define “intelligence” at all! Oops.)

    Either way, the main diff between contemporary “AI” and the 80s stuff is
    that the contemporary version runs with a *much* bigger database
    (anything it can scrape off the interwebs); it still operates on the
    basis of statistical inference (if me has me terminology right). What
    could *possibly* go wrong…?

    The question WTH one would even want an “AI” modeled on humans me’ll
    leave as an exercise for fellow commenta^W^Wthe reader.

  4. SweetLow says:

    OTOH a modern models did not learn still one simple logic rule (one of the corner stone of science method) “No matter how many negative statements you had before – after one verified positive statement all negative statements are false”. So from my point of view they are just giant semantic networks, not intellect.

  5. Michal Necasek says:

    The way it was explained to me is that LLMs have absolutely zero concept of “I’m certain about this” or “I am very uncertain about this”. It is really just a statistical machine generating the most likely sequence of tokens. The LLM does not know what it knows or doesn’t know, it just plows ahead and produces something. That arguably completely undermines any utility LLMs might have in the area of research, because if what you get could be (and often is) complete garbage, why even bother?

    Numerous times I have observed that LLM translation would rather produce a nonsensical string of digits than just say “I don’t know what this means”.

  6. Michal Necasek says:

    I agree that artificial it may be, intelligent it’s certainly not. But Google calls it AI so… that’s their problem if they want to give AI a bad name.

    AI is definitely something that’s been around for 50+ years. Browsing old magazines I noticed how the 386 based PCs were touted to be “great for AI” back in 1986-87.

  7. zeurkous says:

    Transport Tycoon gave “AI” a bad name way before Google was around =)

    Bad terminology, like all bad habits, is quite pervasive and stubborn.

  8. MiaM says:

    @zeurkous:
    Haha, games that generate infrastructure seems to always struggle to some extent.

    The spiritual successor to Transport Tycoon would be Transport Fever 2 and the way the “AI” expands cities is sometimes really weird.

    Also Workers&Soviet Republic has a mode where it optionally generates pre built “old” cities and roads when starting a new game (at least on a random map, can’t remember if it can do this on custom maps?) and in some cases the infrastructure clearly shows that the “AI” struggled to build things. Bonus fun fact: If you immediately pause the game when starting a new game, with pre built cities and roads, you can sometimes see the animation for removing a road where the “AI” at first decided to build a road and then regretted it’s choice :O

  9. Victor Khimenko says:

    The whole most recent AI craze have started as invented in Google for one and ONE task only: machine translation. That’s it, nothing less, nothing more.
    It turned out that it may translate between many different things, not just between human languages, but also between “description of algorithm in English” and “description of algorithm in Python” and even “translate” from the “name of PS/2 system” into “expanded specification of said PS/2 system”.
    But when you ask to translate from the name of model that doesn’t exist… well… it does a VERY credible imitation of a student during exam in a similar situation: it ”tries to read the answer in the eyes of the examiner”.
    Whether to call it AI or not is good question, but there are no thinking or deduction involved, just memory.

  10. Richard Wells says:

    It doesn’t matter what it is called. Is it useful? So far, the answer is no. Sure, some developer could carefully craft a demo that seems to work for one problem but it will be a long time before it can be relied on.

  11. Michal Necasek says:

    I think LLMs may turn out to be a 90% solution. Working alright 90% of the time, and suitable for awesome demos… but impossible to make work reliably 99% or let alone 100% of the time.

  12. Victor Khimenko says:

    Web search also doesn’t work 100% of time… but we have learned to live with that. The biggest problem of AI is not that it fails, sometimes (it couldn’t be reliable simply because of how it’s constructed), but the fact that tries to push it into areas where reliability is expected. Where 100, sometimes 1000 or more steps are needed. If you have reliability of 99% (and that’s more than current AI models deliver!) and do 1000 steps… chances of getting correct result are more-or-less indistinguishable from zero.

  13. MiaM says:

    Web search rarely produce incorrect results, rather no results at all or results that are obviously irrelevant. When it does produce incorrect results we quickly learn that if the search result is Quora it might not be correct.

  14. Bob Nelson says:

    AI – Artificial Idiot

  15. Josh Rodd says:

    Asking Grok questions about PS/2s produces hilariously wrong results, such as asking it why the Image Adapter/A isn’t compatible with an IBM PS/2 P70, where the answer is “Because both of those have proprietary interfaces”. (The actual reason is because the 16-bit slot has the DRAM in the way of a longer card, and the 32-bit slot can’t accommodate an AVE card.)

    Here’s another example: “Why are OS/2 1.x physical device drivers required to be bimodal, that is, real and protected mode capable, in their interrupt routine?”

    Answer: “Interrupt handling can occur in either mode, depending on the context (e.g., a DOS application running in the Virtual DOS Machine or a protected-mode OS/2 process). Device drivers, especially their interrupt routines, must handle interrupts in both modes to ensure compatibility and functionality.”

    Well, no, that answer is 100% incorrect; the context would be a DOS application running in the very much not virtual but rather quite real DOS compatibility box.

    “To avoid duplicating code or requiring separate drivers for each mode, OS/2 1.x mandates that physical device drivers’ interrupt routines be bimodal. This means the routine must detect the current CPU mode (e.g., by checking the processor’s status flags or segment registers) and execute the appropriate code path for real or protected mode.”

    That’s sort of correct…

    “The driver must also handle stack differences (real mode uses a 16-bit stack, protected mode a 32-bit stack) and ensure proper register preservation across modes.”

    Now that’s just laughably incorrect. There’s no bimodal code that can handle both a 16-bit or “32-bit stack” (whatever that is). And OS/2 1.x PDDs aren’t 32-bit at all.

    “This design accommodates the hybrid nature of OS/2 1.x’s architecture and its need to interact with both legacy and modern hardware environments.”

    Chortleingly incorrect.

  16. Michal Necasek says:

    Good stuff. It’s clear that the AI has trouble differentiating between OS/2 1.x and 2.x, which is something humans find trivial.

  17. Jonathan Wilson says:

    The fact that AIs are great at producing output that looks correct at first glance but is actually totally wrong is why I wont use these newfangled generative AIs. If an AI can produce output like that for a search query (output that seems correct but is actually totally wrong) then how can I trust AI to produce correct outpu when asked to generate code or whatever else?

  18. zeurkous says:

    The worst of both worlds.

  19. MiaM says:

    As a side track, I think that the idea with using AI to generate code is to have it generate tedious things that are similar across multiple programs. But I think that is the wrong solution – the actual problem is having dev environments and whatnot that require you to repeatedly do tedious things. I’m sad++ that “Visual Basic” minus Basic never was a thing. Like why didn’t anyone do a “Visual C++” that was actually visual in hos you create your code like inVisual Basic, and not just a GUI editor/debugger?

    Nowadays it would be great to have a development environment where you just “paint an UI” either as a web page or a smartphone app, and just add code to each UI element, and of course some additional things like startup code, code that runs at intervals or when some external event happens, and so on.

  20. Michael Kelly says:

    No different from a human being who makes up an answer based on what they know than admit they don’t know the answer. That same human being if pushed hard enough will eventually admit they don’t have an answer to the question. Maybe human beings should have a sign on their forehead that says “My responses may include mistakes”

    Regardless of where or how a person gets their information it is caveat emptor for the person receiving it that information.

  21. Michal Necasek says:

    How many people like that do you know? People who will very authoritatively give you a detailed and completely made-up answer?

  22. Richard Wells says:

    The classic example is the know-it-all barfly who will regale anyone with dubious factual information as long as prompted by the offer of free beer. Of course, no one made future business decisions based on the recommendations of Cliff Clavin.

    When AI works well, I regard it as equivalent to having a team of well meaning junior high students building a report. They know enough to make a fairly accurate representation of the sources accessed but don’t know what sources are incorrect.

    @MiaM: The Visual Basic coding model worked well with small dashboard database utilities but was very inefficient for larger projects. Indeed, most of the Hypercard style development tools have faded out with possibly Dynamics as the last major product if Dynamics (accounting) is still written with the Dynamics development tool.

  23. MiaM says:

    @Richard:

    Yeah, I get that the model is probably not great for large projects that isn’t as user-input driven as the classic “try out VB” app, creating a calculator, is.

    But it would still had been great if it would had been way easier to “paint a GUI” and use that from within C. Like a feature that creates a skeleton program and later uses a few comments to find the switch/case thing that reacts to UI input, and can automatically insert new things there, with a //todo:write code comment, and also add comments or even comment out parts that isn’t used if a gui element is deleted in the “GUI painter”.

    Going off on a tangent, I’ve always found asynchronous I/O APIs (as in doing file I/O asynchronous, rather than talking to RS232 ports) cumbersome in many OS:es. Like it seems like you were supposed to run file I/O as separate threads, which of course is doable but it feels like the wrong thing to do in a message driven application.

    I send this critique towards more or less all of the OS:es. 16-bit Windows, AmigaOS, Unix with X11, classic Macintosh and whatnot.

  24. Vlad Gnatov says:

    > But it would still had been great if it would had been way easier to “paint a GUI” and use that from within C.
    There are many visual gui constructors, they are just not very popular. I personally use visual tcl*. It’s good at putting together usable gui in minutes and cross-platform. That of course if you can tolerate that dated motif look 🙂

    >Like it seems like you were supposed to run file I/O as separate threads, which of course is doable but it feels like the wrong thing to do in a message driven application.
    That’s because running async events in separate thread is much easier than properly integrating it in the state machine. You can check nginx sources as an example of the latter.

    *) https://www.derekfountain.org/articles/vtcl.pdf

  25. George says:

    This is just more proof that this whole phenomenon has been misnamed. I’d like to put forth these alternatives:
    – AEM (Artificial Eidetic Memory) or
    – TBAG (The Big-ass Guesser)

  26. AlistairH says:

    @MiaM: Maybe Delphi or Lazarus for Pascal or C++ Builder for, erm… C++, is what you are describing?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.