ChatGPT在做什么...为什么它会有效?(十)

The Training of ChatGPT

训练ChatGPT

OK, so we’ve now given an outline of how ChatGPT works once it’s set up. But how did it get set up? How were all those 175 billion weights in its neural net determined? Basically they’re the result of very large-scale training, based on a huge corpus of text—on the web, in books, etc.—written by humans. As we’ve said, even given all that training data, it’s certainly not obvious that a neural net would be able to successfully produce “human-like” text. And, once again, there seem to be detailed pieces of engineering needed to make that happen. But the big surprise—and discovery—of ChatGPT is that it’s possible at all. And that—in effect—a neural net with “just” 175 billion weights can make a “reasonable model” of text humans write.

好的,现在我们已经概述了ChatGPT在设置后是如何工作的。但是它是如何设置的呢?它的1750亿个权重是如何确定的?基本上,这些权重是通过基于人类撰写的大量文本(网络上、书籍等)的大规模训练的结果。正如我们所说,即使考虑到所有的训练数据,神经网络是否能够成功地生成 "类似人类 "的文本,这一点也不明显。而且,似乎同样需要详细的工程细节才能实现这一目标。但ChatGPT的最大惊喜和发现是:这是可能的。事实上,一个 "仅有"1750亿个权重的神经网络就能组成一个可以生成人类撰写文本的“合理模型”。

In modern times, there’s lots of text written by humans that’s out there in digital form. The public web has at least several billion human-written pages, with altogether perhaps a trillion words of text. And if one includes non-public webpages, the numbers might be at least 100 times larger. So far, more than 5 million digitized books have been made available (out of 100 million or so that have ever been published), giving another 100 billion or so words of text. And that’s not even mentioning text derived from speech in videos, etc. (As a personal comparison, my total lifetime output of published material has been a bit under 3 million words, and over the past 30 years I’ve written about 15 million words of email, and altogether typed perhaps 50 million words—and in just the past couple of years I’ve spoken more than 10 million words on livestreams. And, yes, I’ll train a bot from all of that.)

在现代,有很多由人类撰写的文本以数字形式存在。公共网络上至少有数十亿人写的网页,加起来可能有一万亿字的文本。如果包括非公开的网页,这个数字可能至少要大100倍。目前已经有超过500万本电子书可供使用(在已经出版的1亿本书籍中),提供了约1000亿字的文本。这还没有包括从视频等语音中提取的文本。(作为个人对比,我的整个职业生涯中发表的文章总字数不到300万字,过去30年里我写了约1500万字的电子邮件,总共打字可能达到5000万字——而仅仅在过去几年里,我已经在直播中说了超过1000万个单词。而且,是的,我会从这一切中训练一个聊天机器人。)

But, OK, given all this data, how does one train a neural net from it? The basic process is very much as we discussed it in the simple examples above. You present a batch of examples, and then you adjust the weights in the network to minimize the error (“loss”) that the network makes on those examples. The main thing that’s expensive about “back propagating” from the error is that each time you do this, every weight in the network will typically change at least a tiny bit, and there are just a lot of weights to deal with. (The actual “back computation” is typically only a small constant factor harder than the forward one.)

但是,好吧,既然有了这些数据,如何从中训练神经网络呢?基本过程与我们在之前的简单例子中讨论的非常相似。你提供一批示例,然后调整网络中的权重,使网络在这些例子上的误差("损失")最小化。从错误中 "反向传播" 的主要问题是,每次这样做时,网络中的每个权重通常至少会有微小的变化,而有大量的权重需要处理。(实际的“反向计算”通常只比正向计算难一点点。)

With modern GPU hardware, it’s straightforward to compute the results from batches of thousands of examples in parallel. But when it comes to actually updating the weights in the neural net, current methods require one to do this basically batch by batch. (And, yes, this is probably where actual brains—with their combined computation and memory elements—have, for now, at least an architectural advantage.)

使用现代 GPU 硬件,可以并行计算数千个样本的结果。但是,当涉及到实际更新神经网络中的权重时,目前的方法要求逐批次进行更新。(是的,这可能是实际的大脑目前至少在架构上具有优势的地方,因为它们是计算和存储元素的组合。)

Even in the seemingly simple cases of learning numerical functions that we discussed earlier, we found we often had to use millions of examples to successfully train a network, at least from scratch. So how many examples does this mean we’ll need in order to train a “human-like language” model? There doesn’t seem to be any fundamental “theoretical” way to know. But in practice ChatGPT was successfully trained on a few hundred billion words of text.

即使在我们之前讨论的看似简单的学习数字函数的案例中,我们通常也需要使用数百万个样本才能从头开始成功地训练网络。那么,要训练一个“类似人类语言”的模型,我们需要多少样本呢?没有任何基本的“理论”方法可以知道。但在实践中,ChatGPT在几千亿字的文本上被成功训练。

Some of the text it was fed several times, some of it only once. But somehow it “got what it needed” from the text it saw. But given this volume of text to learn from, how large a network should it require to “learn it well”? Again, we don’t yet have a fundamental theoretical way to say. Ultimately—as we’ll discuss further below—there’s presumably a certain “total algorithmic content” to human language and what humans typically say with it. But the next question is how efficient a neural net will be at implementing a model based on that algorithmic content. And again we don’t know—although the success of ChatGPT suggests it’s reasonably efficient.

有些文本被多次输入,有些只输入了一次。但某种程度上,它从看到的文本中“获得了它所需的东西”。但是,考虑到需要学习的文本量,它应该需要多大的网络才能“学得好”呢?同样,我们还没有一种基本的理论方法来说明这一点。最终——正如我们将在下面进一步讨论的那样——人类语言及人类通常使用它所表达的信息,应该具有一定的“总算法内容”。但接下来的问题是,神经网络在实现基于该算法内容的模型时会有多高效。同样,我们不知道,尽管ChatGPT的成功表明它的效率相当高。

And in the end we can just note that ChatGPT does what it does using a couple hundred billion weights—comparable in number to the total number of words (or tokens) of training data it’s been given. In some ways it’s perhaps surprising (though empirically observed also in smaller analogs of ChatGPT) that the “size of the network” that seems to work well is so comparable to the “size of the training data”. After all, it’s certainly not that somehow “inside ChatGPT” all that text from the web and books and so on is “directly stored”. Because what’s actually inside ChatGPT are a bunch of numbers—with a bit less than 10 digits of precision—that are some kind of distributed encoding of the aggregate structure of all that text.

最终我们可以注意到,ChatGPT使用了约数千亿的权重来完成它的任务,这个数字与它所接收到的训练数据的单词(或token)的总数相当。在某些方面,也许令人惊讶(尽管ChatGPT的小型模型也观察到了类似的现象),似乎运行良好的 "网络规模"与 "训练数据的规模"如此相似。毕竟,ChatGPT内部并不是直接存储来自互联网、书籍等所有文本。实际上,ChatGPT内部所包含的是一堆数字——它们的精度略低于10位——它们是所有文本结构的某种分布式编码。

Put another way, we might ask what the “effective information content” is of human language and what’s typically said with it. There’s the raw corpus of examples of language. And then there’s the representation in the neural net of ChatGPT. That representation is very likely far from the “algorithmically minimal” representation (as we’ll discuss below). But it’s a representation that’s readily usable by the neural net. And in this representation it seems there’s in the end rather little “compression” of the training data; it seems on average to basically take only a bit less than one neural net weight to carry the “information content” of a word of training data.

换句话说,我们可以思考一个问题,人类语言的“有效信息量”是什么,以及通常用它说些什么。有原始的语料库的示例。然后是ChatGPT神经网络中的表示。这个表示很可能远非“算法最小化”表示(我们将在下面讨论)。但这是一个神经网络可以轻松使用的表示。在这个表示中,似乎最终很少 "压缩 "训练数据;平均而言,似乎只需要不到一个神经网络的权重就能承载一个训练数据的单词的“信息内容”。

When we run ChatGPT to generate text, we’re basically having to use each weight once. So if there are n weights, we’ve got of order n computational steps to do—though in practice many of them can typically be done in parallel in GPUs. But if we need about n words of training data to set up those weights, then from what we’ve said above we can conclude that we’ll need about computational steps to do the training of the network—which is why, with current methods, one ends up needing to talk about billion-dollar training efforts.

当我们运行ChatGPT来生成文本时,我们基本上必须使用每个权重一次。因此,如果有n个权重,我们就需要做n个计算步骤,尽管在实践中,许多步骤通常可以在GPU中并行处理。但是,如果我们需要约n个单词的训练数据来设置这些权重,那么根据上面所说的,我们可以得出结论,我们需要大约 个计算步骤来进行网络训练,这就是为什么使用当前的方法,人们最终需要谈论价值数十亿美元的培训工作。

Beyond Basic Training

超越基础训练

The majority of the effort in training ChatGPT is spent “showing it” large amounts of existing text from the web, books, etc. But it turns out there’s another—apparently rather important—part too.

训练ChatGPT的大部分工作是向它 "展示"大量来自网络、书籍等的现有文本。但事实证明,还有一个明显相当重要的部分。

As soon as it’s finished its “raw training” from the original corpus of text it’s been shown, the neural net inside ChatGPT is ready to start generating its own text, continuing from prompts, etc. But while the results from this may often seem reasonable, they tend—particularly for longer pieces of text—to “wander off” in often rather non-human-like ways. It’s not something one can readily detect, say, by doing traditional statistics on the text. But it’s something that actual humans reading the text easily notice.

一旦ChatGPT完成对最初的文本语料库的“原始训练”,它就可以开始生成自己的文本,并从提示中继续生成文本等等。但是,虽然这些结果通常看起来是合理的,但它们往往会在一些非常不像人类的方向上“漫游”,尤其是对于较长的文本来说。这不是通过对文本进行传统统计分析可以轻易发现的问题,但是实际阅读文本的人很容易注意到这一点。

And a key idea in the construction of ChatGPT was to have another step after “passively reading” things like the web: to have actual humans actively interact with ChatGPT, see what it produces, and in effect give it feedback on “how to be a good chatbot”. But how can the neural net use that feedback? The first step is just to have humans rate results from the neural net. But then another neural net model is built that attempts to predict those ratings. But now this prediction model can be run—essentially like a loss function—on the original network, in effect allowing that network to be “tuned up” by the human feedback that’s been given. And the results in practice seem to have a big effect on the success of the system in producing “human-like” output.

在构建ChatGPT时的一个关键想法是在“被动阅读”诸如网络等东西之后,再进行另一步:让真实的人类与ChatGPT积极互动,观察它所产生的结果,并就如何成为一个“好的聊天机器人”给予它反馈。但神经网络如何利用这个反馈呢?第一步只是让人类对神经网络的结果进行评分。然后建立另一个神经网络模型,试图预测这些评分。但现在这个预测模型可以在原始网络上运行,基本上就像一个损失函数,从而允许该网络通过已给出的人类反馈进行“调整”。实际上,这种反馈似乎对系统成功产生 "类似人类 "的输出有很大影响。

PS:上面就是我在《学习将会变成什么样?》中所说的第二阶段。

In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions. One might have thought that to have the network behave as if it’s “learned something new” one would have to go in and run a training algorithm, adjusting weights, and so on.

总的来说,有趣的是“最初训练”网络似乎需要很少的“指导”就能让它向特定的方向有效地发展。人们可能会认为,为了要让网络表现得像 "学到了新东西",就必须运行训练算法,调整权重,等等。

But that’s not the case. Instead, it seems to be sufficient to basically tell ChatGPT something one time—as part of the prompt you give—and then it can successfully make use of what you told it when it generates text. And once again, the fact that this works is, I think, an important clue in understanding what ChatGPT is “really doing” and how it relates to the structure of human language and thinking.

但事实并非如此。相反,基本上只需告诉ChatGPT一些东西,作为你给出的提示的一部分,然后它就可以在生成文本时成功地利用你告诉它的东西。我认为,这一点再次成为理解ChatGPT "真正在做什么 "以及它与人类语言和思维结构之间关系的一个重要线索。

There’s certainly something rather human-like about it: that at least once it’s had all that pre-training you can tell it something just once and it can “remember it”—at least “long enough” to generate a piece of text using it. So what’s going on in a case like this? It could be that “everything you might tell it is already in there somewhere”—and you’re just leading it to the right spot. But that doesn’t seem plausible. Instead, what seems more likely is that, yes, the elements are already in there, but the specifics are defined by something like a “trajectory between those elements” and that’s what you’re introducing when you tell it something.

这个系统的某些方面似乎有点类似于人类:至少在经过所有的预训练之后,你可以只告诉它一次东西,然后它就可以“记住”它——至少在 "足够长"的时间来使用它生成一段文本。那么,在这种情况下到底发生了什么?可能是“你告诉它的所有东西已经都在那里了”——你只是引导它去到正确的位置。但这似乎不太可能。相反,更有可能的是,是的,这些元素已经存在,但具体内容是由像“这些元素之间的轨迹”之类的东西定义的,而当你告诉它某些内容时,你就是在引入这个轨迹。

And indeed, much like for humans, if you tell it something bizarre and unexpected that completely doesn’t fit into the framework it knows, it doesn’t seem like it’ll successfully be able to “integrate” this. It can “integrate” it only if it’s basically riding in a fairly simple way on top of the framework it already has.

实际上,就像对于人类一样,如果你告诉 ChatGPT 一些奇怪和意想不到的东西,完全不符合它已知的框架,那么它似乎不能成功地“整合”这些信息。只有在它已有的框架上面以相对简单的方式运行时,它才能够成功地“整合”信息。

It’s also worth pointing out again that there are inevitably “algorithmic limits” to what the neural net can “pick up”. Tell it “shallow” rules of the form “this goes to that”, etc., and the neural net will most likely be able to represent and reproduce these just fine—and indeed what it “already knows” from language will give it an immediate pattern to follow. But try to give it rules for an actual “deep” computation that involves many potentially computationally irreducible steps and it just won’t work. (Remember that at each step it’s always just “feeding data forward” in its network, never looping except by virtue of generating new tokens.)

值得再次指出的是,神经网络“捕捉”的内容不可避免地受到“算法限制”。如果告诉它“浅层次”的规则,比如从“这个变成那个”,神经网络很可能能够很好地表示和复制这些规则——事实上,从语言中“已知的东西”会立即给它一个遵循的模式。但是,如果试图为实际的“深层次”计算,提供涉及许多潜在的计算不可约的步骤规则,它就无法工作了。(请记住,每一步它总是在其网络中“向前输送数据”,除了生成新的token外,从不循环。)

Of course, the network can learn the answer to specific “irreducible” computations. But as soon as there are combinatorial numbers of possibilities, no such “table-lookup-style” approach will work. And so, yes, just like humans, it’s time then for neural nets to “reach out” and use actual computational tools. (And, yes, Wolfram|Alpha and Wolfram Language are uniquely suitable, because they’ve been built to “talk about things in the world”, just like the language-model neural nets.)

当然,网络可以学习特定“不可约”计算的答案。但是,一旦存在组合数的可能性,这种 "查表式 "的方法就不灵了。因此,就像人类一样,现在是时候让神经网络 "伸出手来",使用实际的计算工具了。(是的,Wolfram|Alpha和Wolfram Language非常适合,因为它们是为了 "谈论世界上的事物 "而建立的,就像语言模型的神经网络一样)。


这篇文章是我在网上看见的一篇关于ChatGPT工作原理的分析。作者由浅入深的解释了ChatGPT是如何运行的,整个过程并没有深入到具体的模型算法实现,适合非机器学习的开发人员阅读学习。

作者Stephen Wolfram,业界知名的科学家、企业家和计算机科学家。Wolfram Research 公司的创始人、CEO,该公司开发了许多计算机程序和技术,包括 Mathematica 和 Wolfram Alpha 搜索引擎。

本文先使用ChatGPT翻译,再由我进行二次修改,红字部分是我额外添加的说明。由于原文很长,我将原文按章节拆分成多篇文章。想要看原文的朋友可以点击下方的原文链接。

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

如果你想亲自尝试一下ChatGPT,可以访问下图小程序,我们已经在小程序中对接了ChatGPT3.5-turbo接口用于测试。

目前通过接口使用ChatGPT与直接访问ChatGPT官网相比,在使用的稳定性和回答质量上还有所差距。特别是接口受到tokens长度限制,无法进行多次的连续对话。

如果你希望能直接访问官网应用,欢迎扫描下图中的二维码进行咨询,和我们一起体验ChatGPT在学习和工作中所带来的巨大改变。

0条留言

留言