Urban Myths of AI

This is a rant about cybersecurity and the information space around AI.

A couple of years ago, I was out sailing in NY harbor. A gust picked up my red L.L. Bean baseball cap and blew it into the water. I went home, and I was scrolling Facebook, and the first thing I saw was an ad for a red L.L. Bean baseball cap.

So what are the odds that Facebook really does listen to all your activities and show you ads according to them?

I’ve seen Mark Zuckerberg talk over the years and say things like, “They trust me. Dumb fucks.”

All too plausible that he would try things like that.

But I also know how machine learning works. And the algorithm knows I sometimes shop at L.L. Bean, and certainly that I looked at the red baseball cap if not actually bought it.

And it knows my location. It’s at least equally plausible that machine learning would learn that after a day doing outdoor activities, people sometimes buy baseball caps or outdoor wear, and that it would show me something that I had been interested in.

For sure there are some shenanigans that we don’t know about. But the known ways they monitor and manipulate us are as bad as the conspiracy theories. Who needs to spy on you when you let them know where you are, who with, and every online interaction?

Eventually I quit Facebook:

I had a possibly ADD-related unhealthy relationship with it. I would spend too much time there…even though most of the time was not that positive. There was a lot of addictive clickbait-y junk and stupid stuff, and people getting in my business. And politics.
I felt there was more upside to leaving in terms of overall mental well-being.
Overall not trusting the company. Google and Apple know everything about me. I don’t need every company to know everything, especially ones who have repeatedly showed themselves to be manipulative and untrustworthy.

Now I want to share this tweet.

Okay, not gonna lie, this is stupid.

ChatGPT explicitly provides an opt out to training on your data. Language in contracts for products like ChatGPT Enterprise is also explicit. Here are personal GPT settings:

A lot of work goes into making a clean data corpus for pre-training. If you do a multi-million-dollar training run, and then realize the data set was contaminated with really bad stuff, there’s no way to fix it. You have to start over from scratch, or blacklist some questions because responses expose you to liability. It’s just a disaster and a PR nightmare.

LLMs struggle to acquire new knowledge through post-training and fine-tuning. Reinforcement Learning With Human Feedback, Direct Policy Optimization, Supervised Fine-Tuning are mostly about structuring better responses on the existing knowledge base.

It’s beyond unlikely that OpenAI is intentionally, systematically, taking your private conversations and training with them. I think if you have any picture of how the LLM sausage is made, the scenario does not seem remotely plausible. Of course it’s hallucinating that it read the board minutes.

Nevertheless, I’ve repeatedly had this conversation with, you know, smart IT people who are supposed to be familiar with the technology and ToS. And sometimes people have got a little bit testy with me for telling me their concerns are over the top. “I heard it from X! Trust me, bro.”

I’m not saying there might not be reasons to avoid ChatGPT. How should AI engineers and CTOs and CEOs think about the risks of putting private data, proprietary data into LLMs?

There are levels to this game.

Level 0: What, me worry? Let anybody put anything they want into any SaaS products, install any mobile apps, Chrome extensions that can read and change your Web pages.

This is clearly a very bad idea. Look at 23andMe. They had a severe breach, they went broke, they could have been bought by anybody who would then have had all your DNA information, and not been subject to the contracts that were previously signed.

And frankly this seems like a blind spot in our existing privacy and security landscape. Some rights you shouldn’t be able to sign away, or lose in bankruptcy court.

If you are Goldman Sachs, and Google knows everything you search for, Bloomberg has all your chats with your clients, they will know what you are up to. Other providers who know more than you might be comfortable with: NetJets, your ISP, your mobile provider, whoever runs all the networks of video cameras, personal assistants, butlers and system admins, and so on. Goldman basically started a company that is Signal with compliance because of discomfort with Bloomberg. And a huge hedge fund is basically, what if Google used every bit of online data exhaust for investing. It’s the Wild West out there. Software is eating the world so everything gets captured, and now AI can find everything. You have to do better than ‘nothing’.

Level 1: Understand terms of service and only use trusted SaaS endpoints, and exercise opt-outs, and implement internal policies and controls about what can be pasted into approved AI.

That seems like the bare minimum. Then, as long as you trust that ChatGPT complies properly with the-opt out and the terms of service, and they don’t suffer a breach, you should be okay.

Should you trust OpenAI? They are a startup. Things are a little bit chaotic. Nation-states spy on them. And they have had breaches where your conversations were shown to other users and where info was stolen.

Also they are currently under a court order to retain logs indefinitely and let plaintiffs search them. Presumably a separate pipeline that doesn’t impact e.g. other training and operational pipelines. But again, breaches happen.

So you need to exercise your own judgment about which SaaS services you can trust and what the upsides and the downsides are.

Level 2: Trusted public cloud. If you are a Fortune 500 company or a regulated investment company, or you handle medical or other data that’s subject to HIPAA and other regs, you need stronger representations about the privacy of the data. You might avoid public API endpoints. You might use a cloud provider like Microsoft’s Azure, which is certified ISO 27001 compliant and SOC-3 and HIPAA compliant and audited. And then you access hosted LLMs via that trusted public cloud.

If you use Microsoft Teams and hosted Microsoft Office 365, then you have chosen to trust Microsoft with your internal communications. There is no reason not to trust them for LLMs.

For most companies, major cloud providers are the best they can do. But if you are the CIA, or a really secretive investment firm like Renaissance, you are not even OK with public cloud.

Level 3: On-prem servers and trusted private clouds. You host your own email, and chat, and video-conferencing, and your own LLMs. No Teams, no Zoom, no Gmail. You implement things like biometric access, compartmentalization.

But if you build everything in-house, it’s expensive. You still have to trust someone: your employees, and the vendors who help you build it. And constantly monitor it and update it. And you have to trust that you’re able to do it more securely than Microsoft and AWS and Google, and it’s not their first rodeo. If you’re not extremely capable, you’re just gonna spend a lot more money and create a lot of hassle and end up less secure.

If government-sponsored hackers want to breach you, they still have a good shot, by compromising workers, having their spies work for you, intercepting equipment and tampering with it. With enough effort, they can probably get into anything. If Microsoft fails to stop them, again and again, what chance have you got?

As an IT manager or CTO or CEO, you have to pick a posture based on your own company’s sensitivity to a hack and loss of data and based on your resources, and for most companies, keeping everything in a single trusted cloud is the most appropriate solution.

So choose your own adventure. Pick your poison. How dead are you if you have a breach, and what can you afford to prevent it? Understand your threat models and what will kill you, like ransomware and spearphishers wiring firm or customer funds to a Manila casino. Make sure you are implementing best practices, spending appropriate resources combating appropriate realistic well-grounded threats, and not chasing your tail and making operations more complicated for no reason.

All this to say, there are valid reasons why you might not want to use OpenAI, but them spying on and publishing your private conversations is not one of them. (Unless you somehow leak them onto the Web and into the training corpus).

And you should have people who know what they are doing, not making decisions on the basis of tweets or panel discussions at your typical industry conference.

So what’s my point? Mostly I just wanted to get this off my chest, and the next time it comes up, point to it.

But also, there’s a lot of nonsense out there. There’s bad medical information out there, bad AI information, bad cybersecurity information. It’s a free country, anybody can be a viral influencer on social media.

If you’re a doctor or an IT guy, people already don’t listen, you give simple detailed instructions, they hear what they want to hear and do whatever they want to do. AI just makes it worse, they’ll find someone to tell them authoritatively whatever they want.

There are a lot of professional AI-haters, like Gary Marcus and Ed Zitron, for whom AI can do no right. Even the so-called experts, what they say does not always do a service to the community.

And of course, there’s an insane amount of hype about AI and people who sell it act like it’s better than it is, and will be AGI if you just give them seven trillion dollars, man. Some managers swallow it, and think that you can just say AI, do this complex task and it will spontaneously automate it flawlessly, or at least perform it as well as humans with no errors or hallucinations, and then you can cut staff.

AI figured out language. Which is a huge step forward. But it has a long way to go to be anything like AGI. It has to be able to learn on the fly. It has to integrate the left brain and right brain, Type 1 and Type 2 cognition. But just with a deep understanding of language, we can automate a lot of things, and it will disrupt a lot of professions and industries. And a lot of non-AI savvy companies are going to be disrupted by hungry, smart, AI-first companies.

No one is coming to save you. Our leaders don’t know what they’re doing. There is no Real Room, where real experts in charge with perfect information and real authority work. It’s just us. We are The Room. There’s no replacement for serious thinking grounded in reality, and it’s in short supply.

So prep for artificial intelligence, and also inconceivable amounts of human stupidity. Be careful out there.

(Obligatory disclosure, on some days, and some matters I am without doubt one of the Dumb People. Having made this clickbait title, there will without a doubt be a stupid typo in here.)