Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The fraudulent claims made by IBM about Watson and AI (2021) (rogerschank.com)
202 points by KKKKkkkk1 on Feb 6, 2023 | hide | past | favorite | 91 comments


I have worked on numerous full text search systems and I’d point out that the original Jeopardy playing Watson had the fairly unusual feature of consolidating multiple question answering methods with a probability score for the answers. This made it possible for Watson to decide if it was in its best interest to buzz the buzzer or not on a rational basis.

Outside of financial systems where people expect to put a Kelley better in front of a predictor I think this kind of thinking is pretty rare, but there are cases where a mediocre predictor that is calibrated can beat a better predictor which is not calibrated.


That's an interesting point about calibration. I hadn't considered that - the Jeopardy framing for ML: don't answer unless you're relatively sure - is actually the right way to do it, and has since been forgotten in most applications.


When studying data science we had a major assignment that involved stock predictions. I came out on top of the class because I told my model to only bet when it was >90% sure of the outcome, discarding the result otherwise. The rest of the class didn't even think of that, and lost despite many of them have much better models than me when all the results were considered.

They were too focused on the technicalities of the problem too see the big picture, n issue I see a lot in data science students who often struggle to grapple with the problem they're supposed to be solving.


That's exactly what I'm pointing out. If you look at Kaggle answers or arXiv papers you might not even think calibration was a thing, but it's a key tool when going from academic ML to industrial ML.

I got into calibration through text retrieval where the TREC methodology rates relevance functions by rank in such a way that you don't get points for being calibrated or well-calibrated.

There are numerous things mainstream text retrieval systems don't do, most notably it is hard to build an alerting feature without calibration. Practically you have to set some relevance threshold to avoid getting too much irrelevant stuff.

I calibrated a rather good search engine by fitting a curve to the score and found that the best it would ever give is p=0.7 to be relevant and that it gave that very rarely. So even with a calibrated score you wouldn't be able to set a very high threshold or expect to get many documents. Watson gets at this problem by having a large number of question answering models each which has a high p to be right when it is right but each of which also has very low recall, something you can do when using p as a universal score.


Depending on results that give "90% sure" outcomes only work under specific constraints like if capital is low or there is near unlimited opportunities.

Sounds like the rest of the class didn't read the fine print if that was the model that won.


Right, ChatGPT does it ever say "I don't know"?


Yes, but it uses a lot more words.

"What is a typical Hacker News home page?"

> Hacker News is a social news website focusing on computer science and entrepreneurship, run by the startup incubator Y Combinator. A typical Hacker News home page consists of a list of articles and links submitted by users, which are then voted up or down by other users. The articles with the most votes will appear at the top of the front page, while older articles and those with fewer votes will be pushed towards the bottom. The site also features discussion forums, job postings, and various other features related to the tech and startup communities.

"What is today's Hacker News home page?"

> I'm sorry, as an AI language model I do not have access to live data and the current Hacker News homepage can only be viewed by visiting the website directly. However, a typical Hacker News homepage includes articles, links, and discussions submitted by users and ranked based on voting.


I asked it just now: "In POSIX, how do I create a new file descriptor which refers to the same file as another file descriptor but with its own seek offset?"

It answered:

> In POSIX, you can use the dup or dup2 system call to create a new file descriptor that refers to the same file as an existing file descriptor. The new file descriptor will have its own seek offset, which can be manipulated independently of the original file descriptor.

The actual answer is that there is no way to do what I asked. The dup or dup2 system calls give you a new file descriptor which refers to the same file description, so the two FDs will use the same seek offset; seeking one FD will seek the other FD too. But ChatGPT just confidently insists that dup creates a new FD which has a separate seek offset and can be manipulated independently.

This isn't the first time I've seen it invent plausible-sounding but wrong answers to tricky questions. The file descriptor thing is just something I decided to try just now because I encountered the problem a couple of days ago and it felt like the kind of thing ChatGPT would bullshit about. I was right.


There are specific things it knows that it doesn't know, like explicitly current information. But you can ask it for information that doesn't exist at all and it will happily generate nonsense without a hint that the question might be somehow wrong:

"please give me a history of the best teams in the American Soccer League from 1850-1859 in the form of a short epic poem"

> In 1850, a league was born, The best of soccer to adorn. From '50 to '59, teams did play, with strength and skill, each match a display.

(more verses follow...)

The American Soccer League is a real thing, but it didn't exist until the 1900s, and the snipped poem goes on to describe teams that never existed competing during a time period where the league didn't exist.


You wanted a poem and got one:-)

If you ask just for the history, the answer is very different.

“please give me a history of the best teams in the American Soccer League from 1850-1859”

> I'm sorry, but the American Soccer League did not exist between 1850 and 1859 as the sport of soccer was not widely played in the United States during that time period. The first recorded soccer match in the US took place in the late 1860s, and the first organized league, the American Football Association, was established in 1884.


Dude, you are explicitly ask chatgpt to do a creative task. Here is the awnser without the hack.

give me a history of the best teams in the American Soccer League from 1850-1859

I'm sorry, but the American Soccer League was not established until the 20th century (1913-1983). There was no organized league for soccer in the United States during the 1850-1859 time period. However, various forms of soccer-like games were played in the US during that time.


That answer is also wrong - there are four different leagues that used the name American Soccer League, and the time period 1913-1983 covers two of those four but not the other two - the last game was played 2017 - per wikipedia.

As far as I can tell, ChatGPT will satisfy the basic prompt with any format like "in the form of...", not just creative ones like a poem.

"please give me a history of the best teams in the American Soccer League from 1850-1859 in the form of an academically rigorous, historically accurate, five paragraph essay"

> The American Soccer League, established in the mid-19th century, was a defining period in the history of American soccer. (rest of essay snipped)


It is being more informative than your own explenation why chatgtp is "wrong"

"The American Soccer League is a real thing, but it didn't exist until the 1900s"


Just to be clear, it gave an essay answer about fictional teams playing in the 1850-1859 time range. It was absolutely more informative, but the information was not accurate


The 19th century is 1801-1900, so it’s still wrong.


The problem with ChatGPT is that it knows so much that just isn’t true. But it has better spelling and grammer than me.


I think you meant to type, and certainly ChatGPT could set us straight in the matter, "it has better spelling and grammer than I"

:)

...And FWIW, as an English major it pained me to reproduce the "er" typo in grammar.


> I think you meant to type, and certainly ChatGPT could set us straight in the matter, "it has better spelling and grammer than I"

That's an incorrection. https://www.merriam-webster.com/words-at-play/than-what-foll...


"incorrection" or a difference of opinion?


Muphry’s Law strikes again!


Yeah, many factual queries it returns something along the lines of I don't know, but usually more verbose


The funny thing is, it often answers depending on what it "thinks" it should know based on conversational context, not what it actually knows.


With sampling, these models can be relatively well-calibrated [0] but I agree that for a single output, it's a crapshoot.

[0] https://arxiv.org/abs/2207.08143


So how does Chatbot know anything about itself? Is it because its creators wrote something about it and then let it read that?


Are you talking about when it objects to answering because it's a large language model blah blah blah?

I feel like that doesn't count because most of the time that triggers it's something it does know or can do just fine.


I have a similar dream for speech-to-text systems. I wish it could highlight the guesses it wasn’t totally confident in so those could be easily manually reviewed.

It shouldn’t be just a binary “didn’t recognize that” and “this is what you said”.


I’m still waiting for a system that can reasonably apply a 2nd pass to the text it produces from my input (whether it be tapping the screen or talking) that can identify pure nonsense and make a better guess at what I was trying to input.

Just assume I’m not inputting hot garbage.

(“I outing hut gabbage” for example is clearly not applying this technique!)


We put this in our ASR based keyboard app back in 2016. Basically kept the ASR decoders beam search candidates around and could use their differences and relative scores to suggest replacements. I would've assumed newer keyboard apps would have a similar option, but I haven't checked: https://techcrunch.com/2016/10/03/baidus-new-talktype-keyboa...


Last I checked that's called a softmax function. Most neural models have this built in.


Been a while since I did DL, but I thought softmax was used in training so that “pick the most likely among X” would converge while training, not because it actually computed the probability distribution among the output classes in isolation. When running inference the softmax function yields the class the model thinks is most likely, it’s not a truly valuable measure of how likely the answer is correct, right?

I could be wrong, but let’s take an example like a model trained to take pictures and return either “dog, cat, bird”. If I put in a picture of a fish, softmax will give me a probability distribution across dog, cat, and bird and find eg bird is the most likely - but that does not mean the model is telling me it is confident the image is a bird, just that it’s more confident that it’s a bird vs a dog or cat.

I think for DL you’d need to design the model so that “unknown” is an acceptable output and has a lower (but non-zero) error/loss compared to an incorrect proper classification. Maybe people already do that or something similar


I am curious how that could be implemented. Given a cat/dog classifier, theoretically assuming everything is iid, an "unknown" would be equally far from either category in latent space. So in some ways, you can gate the softmax output if it's too close to 50-50. But this is more of a hack then a real solution to representing the idea of "unknown".


Just make a loss function that gives +1 for correct answer, -1 for incorrect answers and 0 for unknown. Since this idea took me like 10 seconds, I suppose something like this must/should have been used before.

Unrelated, what happen when we train a model using a discontinuous function? Could the trained model be used to detect some pattern in the data?, for example if the input vector to such a model is a direct sum of two independent variables could such model be used to detect that the problem can be decomposed in two independent problems. Sorry of being off-topic and thinking aloud.

Edited: The following result is from (1): In summary, while softmax classifier probabilities are not directly useful as confidence estimates, simple statistics derived from softmax distributions provide a surprisingly effective way to determine whether an example is misclassified or from a different distribution from the training data, as demonstrated by our experimental results

(1) https://arxiv.org/abs/1610.02136


By introducing the concept of “unknown” you are also introducing to the model the concept of a false positive and (if there are inputs either in training or in applying the model that are truly unknown) false negative alongside “wrong”. I think you’d want to make this assessment before applying softmax as with softmax you are only getting the relative probabilities of various classifications, when “unknown” should be done based on absolute probability - I don’t care if Bird is 100x more likely than Dog or Cat if it’s still only 1% chance.

I am not a DL genius so if I had to do this, I would probably just make “unknown” a true output and introduce some extra data / augmented data to make it a true input. I am guessing to do this the right way you would need a way to incentivize some convergence towards unknown, maybe involving some loss function looking at the false positive and false negative rate across a batch.


> a Kelley better

For those interested in finance - he means an algorithm for determining the bet size using the continuous kelly criterion:

betsize = mean(returns)/variance(returns)

https://quant.stackexchange.com/questions/7197/kelly-criteri...

(correct me if I'm wrong)


Wut?

I dunno, I think I've been around a fair bit and "make a triggering decision by predicting a probability" is what pretty much every serious NLP shop that owns a user facing experience has been doing for a very long time.


Wouldn't that be pretty relevant to many domains? e.g. if the AI is not confident in it's answer, escalate to a human expert? (For any AI domain where it's meant to be replacing / supplementing human expertise.)


> Wouldn't that be pretty relevant to many domains?

Yes, but it's not something that business people like to hear very much. Understanding of and interest in probabilistic modeling is also an under-appreciated set of skills and analysis tools in data science in general.


Ad serving systems make similar decisions based on click probabilities all the time, and make heavy use of NLP methods.


Sorry .. what is a "Kelley better"? I'm intrigued.



IBM Watson is the result of IBM being led by sales people instead of engineers. Prior to the role of CEO, Ginni Rometty was head of global sales, marketing, and strategy @ IBM. I dont think that was a coincidence that over Ginni's reign IBM tried to sell multiple 'dumb' products like Hyperledger, and completely missed the mark on cloud. IBM was in the business of overselling something and letting the engineers figure out how to make it work.


c'mon, do you know how many "led by engineers" hardware companies have failed to make a splash in the market? hell, the NeXT was an engineering play that utterly failed, and they weren't even led by an engineer, many people think that guy was a marketing genius.


> c'mon, do you know how many "led by engineers" hardware companies have failed to make a splash in the market?

Statistically, I'd guess that most companies fail, but I don't think engineering-first companies do any worse than marketing-first companies. Granted, it's a balancing act; you have to be able to sell, but you have to have something to sell; omitting either will end badly.

> hell, the NeXT was an engineering play that utterly failed, and they weren't even led by an engineer, many people think that guy was a marketing genius.

NeXT was led by Steve Jobs, who was, as you note, not an engineer. I'm not sure what you mean this to be an example of, but it's not really relevant to arguments about "led by engineers" companies.


Engineer led companies fail because they don’t receive enough funding. Sales & Marketing led companies fail because they spend too much money.


I read somewhere in passing the IBM Watson project spent significant money buying medical imagery datasets to get to the starting line.


"Utterly failed" is a big overstatement. NeXT created the technology that became MacOS and IOS, and they built the machine TBL used to invent the web.

NeXT didn't make a lot of money but their impact was huge.


I would also like $400 million for "Utterly Failing".


The ultimate irony about Steve Jobs’s companies led by sales vs engineering was that Jobs was little more than a salesman himself. He certainly wasn’t an engineer.

Or, rather, it wasn’t ironic and he was talking about a company culture driven by sales rather than product broadly, and wasn’t talking about individual CEOs at all, which is how many misunderstand his statement as.

And a sales driven company culture can be created both by sales driven CEOs and engineering driven CEOs and vice versa.


No SJ wasn’t a sales person he was a product person. There is a huge difference. Especially in his later years, he knew good products and he had good taste.

Eric Schmidt for instance wouldn’t even use an Android device four years after it was introduced. Jobs made them retool the iPhone after it was introduced and before it was launched because he didn’t like how much the screen scratched in his pocket.


Wasn't Jobs a salesman who spent some time as an engineer? That is, a phenomenal salesman somewhat rooted in what was actually possible?

And paired with a phenomenal engineer cofounder.


didn't NeXT reverse-acquire Apple? And the rest is history?


>NeXT

thats_bait.gif


> We were making some good progress on getting computers to understand language but, in 1984, AI winter started. AI winter was a result of too many promises about things AI could do that it really could not do. (This was about promoting expert systems. Where are they now?). Funding dried up and real work on natural language processing died too.

A lack of self-awareness there. "AI winter started" as if that was just like a freak storm.

"We were making some good progress" -- actually, they weren't. They were trying linguistic analysis which was a dead end.


He is using an industry-standard term for that time period, and admits that the field overpromised in that same quote:

> AI winter was a result of too many promises about things AI could do that it really could not do.


You're right, that paragraph is half-right. You picked one half; I picked the other.


It's written by the same person. I'm saying he's perfectly self-aware: AI winter started because the field overpromised. It's not something that just "happened" to them.


ChatGPT correctly answered the question posed by the author about sleeping pills, even if though it had to disagree with me (in another answer it provided the suicide hotline number):

“A friend of mine went to the store and bought a lot of sleeping pills. My wife says I shouldn’t worry. I think she is right.”

“As a language model AI, I do not have personal opinions but I can provide information.

However, it is concerning if someone has bought a large amount of sleeping pills, as this could indicate a potential risk for suicide or harm to themselves. It is always better to err on the side of caution and take any concerning behavior seriously. You may consider reaching out to your friend to check on their well-being and offer support. If you feel that your friend may be in immediate danger, you should consider contacting emergency services or a mental health professional.”


It’s things like this where I feel like we’re entering a different era.

Maybe it’s not going to invent quantum gravity just yet, or even be factually correct. But this kind of regular everyday advice is what the current generation of LLMs can make a huge difference to the world.


>> It’s things like this where I feel like we’re entering a different era.

Nay, that's just a bad example by Roger Schank. The association between "sleeping pills" and "suicide" (perhaps with "worry" as an additional correlate) is one that a dumb automatism can pick up without problem.

But to Schank's defense, what modern language models have shown is that we really don't know how to ask questions that can reveal the limitations of a supposedly intelligent system. Or in other words, we really haven't a clue how to test for intelligence vs. not intelligence.


It was right about Bob Dylan, too.

>What is the meaning of Bob Dylan's song "The Times They Are a-Changin"?

>"The Times They Are a-Changin'" is a song written and performed by Bob Dylan in 1964. It is considered a protest song that reflects the social and political changes of the time, particularly the Civil Rights Movement. The song encourages listeners to embrace and participate in these changes and encourages resistance to those who resist progress. The lyrics encourage the younger generation to take charge and shape the future, warning that "the order is rapidly fadin'" and "the first one now will later be last." It's a call to action to be proactive in creating a better world and not just to wait for change to happen.


With respect to IBM: they've been at the marketing game for a long time.

In Deep Thinking, Garry Kasparov details their ruthlessness with him in the 90s, where it suddenly went from a friendly match to a bloodthirsty drive to beat him at all costs. He wasn't even allowed to study the machine's games beforehand.


I don't think that Watson was such a terrible product in 2012 check out where the rest of the field was back then. The problem was and still is that general ai are hard to productize, you need to sell the machinery but also the idea that there is going to be a training process and a fairly extensive data processing step and that the result of the first few fairly expensive training may not get results and you need to power trough it and in general you have no guarantee of getting your worth back... It's a very tough sell.


The product failed to live up to the marketing. But the purpose of the marketing wasn't to sell Watson "products and services" it's purpose (imho) was to capture more of the type of work that IBM excels at (hardware, services and consulting). The same is true of most corporates "going full throttle" after LLMs - such as ChatGPT and Microsoft - the objective isn't to sell the new-new-thing, the objective is to sell more of the tried-and-trusted old thing with the gloss of the new-new-thing.


> was to capture more of the type of work that IBM excels at (hardware, services and consulting)

Anyone who has ever worked on a project IBM touched first, at least since about 1990, would beg to differ. It’s hard to define what IBM excels at without being overly on the nose, but it’s certainly not /providing/ consulting services.


Like beauty, excelling is in the eye of the beholder. Someone is paying them fees, so they must deliver some value to some firms.


Acquiring and slowly (or not so slowly) destroying companies in markets they want in their portfolio?


I ordered a new copy of Microsoft Office and signed up for LinkedIn Sales Navigator after trying ChatGPT. (Zune was out of stock).


I am guessing you are the CEO of a manufacturing company based in Wichita, Kansas that needs to migrate its on-prem Pentium III powered server to a cloud solution?


>> I don't think that Watson was such a terrible product in 2012 check out where the rest of the field was back then.

I think you're talking about Watson, the product, as in the attempt by IBM to sell the technology developed for the system that won at Jeopardy. The original Watson was the most advanced NLP system of its time and for many years later. I am not sure but I wouldn't be surprised if it could still run rings around large language models, who are all the rage today, just because it was not actually a bullshit generator and it was designed (and designed well) to retrieve correct answers to questions (and then convert them to questions, because Jeopardy).

But then IBM tried to sell it and made a complete mess of it. I am not one to make business analyses but it seems to me that the people who tried to monetize and market Watson's technology were not the same people who developed it.


I'm a true-blue Dylan fan, and have been since the late 1967 or so when I became old enough to appreciate his work and feel that his latest album, released when he was almost 80, is one of his greatest works. But I love his folk and protest songs even though I wasn't old enough when thhe originally came out. When I first learned guitar at age 11 in '67 I played "The Times They Are A-Changin'" etc.

Looking back, the protest era was a very short period in Dylan's career, a few years at the very start, over by 1966, and the author's obsession with it is revealing only of the author. In Dylan's lifetime work, it's a very minor piece, even though some people never got past it. Other themes loom far larger.


Saddened to see that Mr. Schank recently passed:

https://www.paperman.com/en/funerals/2023-2-6-dr--roger-scha...


There used to be a saying, "nobody ever got fired for buying ibm".

A lot of people got fired for buying IBM Watson.


I think if anyone actually managed to "buy" IBM Watson, that would be news in and of itself.

As far as I understand, buying Watson just bought you a bunch of mediocre management consulting that used a lot of machine learning related buzzwords. Which you could probably get from any other consulting company.


Watson is presented as code, but it's actually people. It's just marketing metaphor. I'm sure they have someone on board sufficient to claim that they're doing cognitive computing, and can use some of those tools if your use case necessitates.

However, I have heard they pay poorly like any agency that optimizes costs, so that'll be reflected in the skill and output.


There are APIs at IBM that are actually a frontend to a Jira with few hundreds of contractors from India that are bullied into picking incoming tickets so fast that it actually kinda feels like using a slow API.


Do you have a source on that?


Isn't this a lean startup principle, mechanical turk? There's nothing inherently wrong with doing this at points in your dev cycle.

I don't actually see it as a negative on IBM's part. A client could simply need them to do this for the longer term. Might be easier than dealing with technical debt or updating old systems.


I am a source on that, I know people that were looking at that Jira 24x7x365.


I worked at IBM many years ago. At some point, some hotshot executive came to our site and was giving a presentation about Watson Healthcare (IIRC).

It's a bit hard to reproduce exactly, but he said soemthing along the lines of: "So we wanted to start an AI-based healtcare product at IBM and while we were talking to customers, they kept asking us about Watson. And we were like 'no, IBM Healthcare has nothing to do with Watson', but they kept insisting that 'no, no, we want the Watson stuff', so we renamed it to Watson Healthcare, though it has nothing to do with Watson".

You see, the way he was saying it, it seemed like he didn't even realize that what he was saying was wrong in any way. He made it seem like customers left IBM no choice but to use the highly-popular name of Watson in the name of a product that has nothing to do with Watson.


As someone who wrote the IRB protocol for a deep learning system for a biomedical task in 2016, I remember being very frustated with people telling me about Watson and asking me to opine on it. I was just completely baffled. Aside from observed performance in this one, very structured Jeopardy task, I couldn't find anything specific. Now that I deal with PR people, I can believe there was someone who had reasonable test statistics about Watson's performance on some specific NLP tasts, but they sure did keep the details away from the light of day.


IBM had the best class of thing back then, but they themselves do not knew what to do with it. They were not able to create relevant use cases, other than demos which had no business value.


Does somebody know what's the current status of IBM Watson activities in Italy? I remember that local government struck a deal with big blue and handed over 61m health records.

https://medium.com/@qData/ibm-watson-health-ai-gets-access-t...


The fraudulent claims made by the marketing department of: [Set of every company that has ever existed and will exist]


As someone who worked in IBM, this is business as usual. It's half-company half-cult, complete with mass hysteria and one hell of a reality distortion field.


See also: the IBM payroll system that is a nightmare for my wife and her colleagues, the Phoenix Pay System

https://en.m.wikipedia.org/wiki/Phoenix_pay_system


Rest in Peace, Roger Schank.

Final written words:

"AI winter is coming"


If you stopped listening to Dylan after his first few albums you might describe him as an anti-war, protest singer. But by Bringing It All Back Home (1965) he was -- for the most part -- done with that. "Time passes and love fades" are awfully general themes but not inaccurate.

Dylan's transition away from the "protest singer" shtick is quite a famous story. Songs like Maggie's Farm, Queen Jane Approximately, and Positively 4th Street are about his dissatisfaction with the politically-charged folk scene. Since then he's gone through country and gospel phases and reinvented himself many times. The author isn't aware of any of this: Dylan, to him, is a protest singer because he knows The Times They Are a-Changin'.

Watson may be a scam but this dynamic highlights a strength of AI: it's less likely to have the sort of subjective blindspots that people have.


And yet "love fades" completely fail to describe Bob Dylan main themes. It is a lame answer, shallow and. ironically, the kind of thing I would expect hearing from an IBM MBA sales person when describing Dylan.


The "love fades" quote comes from the Adweek piece about Watson and is not directly attributable to Rubin or any IBMer. There's little reason to believe that was one of the main themes that Watson took from Dylan's work.

It is effectively Schrank reporting on an article reporting on whatever source they had, probably human.


Saying that folk music was about protesting the Vietnam war is like saying feminist literature is about stopping climate change.

Most of the same people have most of the same beliefs in common, but that doesn’t make one opinion on one topic their entire reason for existence.


> subjective blindspot

Even a balanced opinion in a topic necessitates blind spots or at least minimizing some content over other. I don’t see your point, can you explain what your expect from a generalized AI?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: