50年前,英特爾創(chuàng)始人之一戈登·摩爾提出了摩爾定律:集成電路上可容納的晶體管數(shù)目,約每隔24個月便會增加一倍。但近兩年,關(guān)于摩爾定律是否失效的爭論不斷。
與摩爾定律(Moore"s law)相伴而生的是登納德縮放定律(Dennard scaling),即隨著晶體管密度的增加,每個晶體管的功耗會下降,因此每平方毫米硅的功耗幾乎是恒定的。但登納德縮放定律在2007年開始顯著放緩,到2012年幾乎失效。
也就是說,半導(dǎo)體技術(shù)的更新迭代顯然已經(jīng)無法帶來飛躍的性能增長,即使多核設(shè)計也沒有顯著改善能效方面的問題。這種情況下,能否找到更加高效的利用方法?未來半導(dǎo)體行業(yè)又會出現(xiàn)哪些變化趨勢?
圍繞這一問題,在鈦媒體和國家新媒體產(chǎn)業(yè)基地聯(lián)合主辦的2021 T-EDGE全球創(chuàng)新大會上,Google母公司Alphabet董事會主席、2017年美國圖靈獎獲得者、斯坦福大學(xué)原校長John Hennessy發(fā)表了題為《深度學(xué)習(xí)和半導(dǎo)體技術(shù)領(lǐng)域的趨勢和挑戰(zhàn)》演講。
在他看來,實現(xiàn)更高的性能改進(jìn)需要新的架構(gòu)方法,從而更有效地使用集成電路功能。具體的解決方案有三個可能的方向:
1、以軟件為中心的機(jī)制。即著眼于提高軟件的效率,以便更有效地利用硬件;
2、以硬件為中心的方法。也稱為特定領(lǐng)域架構(gòu)或特定領(lǐng)域加速器;
3、以上兩類的部分結(jié)合。開發(fā)出與這些特定架構(gòu)相匹配的語言,讓人們更有效地開發(fā)應(yīng)用程序。
在這樣的變化之下,John Hennessy認(rèn)為,“未來通用處理器將不是驅(qū)動行業(yè)發(fā)展的主力,能夠與軟件快速聯(lián)動的特定領(lǐng)域處理器將會逐漸發(fā)揮重大作用。因此,接下來或許會看到一個更垂直的行業(yè),會看到擁有深度學(xué)習(xí)和機(jī)器學(xué)習(xí)模型的開發(fā)者與操作系統(tǒng)和編譯器的開發(fā)者之間更垂直的整合,使他們的程序能夠有效運(yùn)行、有效地訓(xùn)練以及進(jìn)入實際使用。”
以下為John Hennessy演講實錄,經(jīng)鈦媒體編輯整理:
Hello I"m John Hennessy, professor of computer science and electrical engineering at Stanford University, and co-winner of the Turing Award in 2017.
大家好,我是約翰·軒尼詩,斯坦福大學(xué)計算機(jī)科學(xué)與電氣工程教授,也是2017 年圖靈獎共同獲得者。
It"s my pleasure to participate in the 2021 T-EDGE conference.
很高興能參加 2021年的 T-EDGE 大會。
Today I"m going to talk to you about the trends and challenges in deep learning and semiconductor technologies, and how these two technologies want a critical building block for computing and the other incredible new breakthroughs in how we use computers are interacting, conflicting and how they might go forward.
今天我想談?wù)勆疃葘W(xué)習(xí)和半導(dǎo)體技術(shù)領(lǐng)域的趨勢和挑戰(zhàn)、這兩種技術(shù)需要的關(guān)鍵突破、以及計算機(jī)領(lǐng)域的其他重大突破和發(fā)展方向。
AI has been around for roughly 60 years and for many years it continues to make progress but at a slow rate, much lower than many of the early prophets of AI had predicted.
人工智能技術(shù)已經(jīng)存在大約 60 年,多年來持續(xù)發(fā)展。但是人工智能技術(shù)的發(fā)展開始放緩,發(fā)展速度已遠(yuǎn)低許多早期的預(yù)測。
And then there was a dramatic breakthrough around deep learning for several small examples but certainly AlphaGo defeating the world’s go champion at least ten years before it was expected was a dramatic breakthrough. It relied on deep learning technologies, and it exhibited what even professional go players would say was creative play.
在深度學(xué)習(xí)上我們實現(xiàn)了重大突破。最出名的例子應(yīng)該就是 AlphaGo 打敗了圍棋世界冠軍,這個成果要比預(yù)期早了至少十年。Alpha Go使用的就是深度學(xué)習(xí)技術(shù),甚至連專業(yè)人類棋手也夸贊Alpha Go的棋藝頗具創(chuàng)意。
That was the beginning of a world change.
這是巨變的開端。
Today we"ve seen many other deep learning breakthroughs where deep learning is being used for complex problems, obviously crucial for image recognition which enables self-driving cars, becoming more and more useful in medical diagnosis, for example, looking at images of skin to tell whether or not a lesion is cancerous or not, and applications in natural language particularly around machine translation.
今天,深度學(xué)習(xí)也在其他領(lǐng)域取得重大突破,被應(yīng)用于解決復(fù)雜的問題。其中最明顯的自然是圖像識別技術(shù),它讓自動駕駛技術(shù)成為可能。圖像識別技術(shù)在醫(yī)學(xué)診斷中也變得越來越有用,可通過查看皮膚圖像判斷是否存在癌變。除此之外,還有在自然語言處理中的應(yīng)用,尤其是在機(jī)器翻譯方面頗具成果。
Now for Latin-based language basically being as good as professional translators and improving constantly for Chinese to English, a much more challenging translation problem but we are seeing even a significant progress.
目前,拉丁語系的機(jī)器翻譯基本上能做到和專業(yè)翻譯人員相似的質(zhì)量。在更具挑戰(zhàn)的漢英翻譯方面上,機(jī)器翻譯也有不斷改進(jìn),我們已經(jīng)能看到顯著的進(jìn)步。
Most recently we"ve seen AlphaFold 2, a deep minds approach to using deep learning for protein folding, which advanced the field by at least a decade in terms of what is doable in terms of applying this technology to biology and going to dramatically change the way we make new drug discovery in the future.
近期我們也有 AlphaFold 2,一種使用深度學(xué)習(xí)進(jìn)行蛋白質(zhì)結(jié)構(gòu)預(yù)測的應(yīng)用,它將深度學(xué)習(xí)與生物學(xué)進(jìn)行結(jié)合,讓該類型的應(yīng)用進(jìn)步了至少十年,將極大程度地改變藥物研發(fā)的方式。
What drove this incredible breakthrough in deep learning? Clearly the technology concepts have been around for a while and in fact many cases have been discarded earlier.
是什么讓深度學(xué)習(xí)取得了以上突破?顯然,這些技術(shù)概念已經(jīng)存在一段時間了,在某種程度上也曾被拋棄過。
So why was it able to make this breakthrough now?
那么為什么現(xiàn)在我們能夠取得突破呢?
First of all, we had massive amounts of data for training. The Internet is a treasure trove of data that can be used for training. ImageNet was a critical tool for training image recognition. Today, close to 100,000 objects are on ImageNet and more than 1000 images per object, enough to train image recognition systems really well. So that was the key.
首先是我們有了大量的數(shù)據(jù)用于訓(xùn)練AI?;ヂ?lián)網(wǎng)是數(shù)據(jù)的寶庫。例如 ImageNet ,就是訓(xùn)練圖像識別的重要工具。現(xiàn)在ImageNet 上有近 100,000 種物體的圖像,每種物體有超過 1000 張圖像,這足以讓我們很好地訓(xùn)練圖像識別系統(tǒng)。這是重要變化之一。
Obviously we have lots of other data were using here for whether it"s protein folding or medical diagnosis or natural language we"re relying on the data that"s available on the Internet that"s been accurately labeled to be used for training.
我們當(dāng)然也使用了其他大量的數(shù)據(jù),無論是蛋白質(zhì)結(jié)構(gòu)、醫(yī)學(xué)診斷還是自然語言處理方面,我們都依賴互聯(lián)網(wǎng)上的數(shù)據(jù)。當(dāng)然,這些數(shù)據(jù)需要被準(zhǔn)確標(biāo)記才能用于訓(xùn)練。
Second, we were able to marshal mass of computational resources primarily through large data centers and cloud-based computing. Training takes hours and hours using thousands of specialized processors. We simply didn"t have this capability earlier. So that was crucial to solving the training problem.
第二,大型數(shù)據(jù)中心和云計算給我們帶來了大量的運(yùn)算資源。使用數(shù)千個專用處理器進(jìn)行人工智能訓(xùn)練只需要數(shù)小時就能完成,我們之前根本沒有這種能力。因此,算力也是一個重要因素。
I want to emphasize that training is the computational intensive problem here. Inferences are much simpler by comparison and here you see the rate of growth of performance demand in petaflops days needed to train a series of models here. If you look at training AlphaZero for example requires 1000 petaflops days, roughly a week on the largest computers available in the world.
我想強(qiáng)調(diào)的是,人工智能訓(xùn)練帶來的問題是密集的算力需求,程序推理變得簡單得多。這里展示的是訓(xùn)練人工智能模型的性能需求增長率。以訓(xùn)練 AlphaZero 為例,它需要 1000 pfs-day,也就是說用世界上最大規(guī)模的計算機(jī)來訓(xùn)練要用上一周。
This speed has been growing actually faster than Moore"s law. So the demand is going up faster than what semiconductors ever produced even in the very best era. We"ve seen 300,000 times increase in compute from training simple models like AlexNet up to AlphaGo Zero and new models like GPT-3 had billions of parameters that need to be set. So the training in the amount of data they have to look at is truly massive. And that"s where the real challenge comes.
這個增長率實際上比摩爾定律還要快。因此,即使在半導(dǎo)體行業(yè)最鼎盛的時代,需求的增長速度也比半導(dǎo)體生產(chǎn)的要快。從訓(xùn)練 AlexNet 這樣的簡單模型到 AlphaGo Zero,以及 GPT-3 等新模型,有數(shù)十億個參數(shù)需要進(jìn)行設(shè)定,算力已經(jīng)增加了 300,000 倍。這里涉及到的數(shù)據(jù)量是真的非常龐大,也是我們需要克服的挑戰(zhàn)。
Moore"s law, the version that Gordon Moore gave in 1975, predicted that semiconductor density would continue to grow quickly and basically double every two years but we began to diverge from that. Really quickly diverge began in around 2000 and then the spread is growing even wider. As Gordon said in the 50th anniversary of the first prediction: no exponential is forever. Moore"s law is not a theorem or something that"s definitely must hold true. It"s an ambition which the industry was able to focus on and keeping tag. If you look at this curve, you"ll notice that for roughly 50 years we drop only a factor of 15 while gaining a factor of more than almost 10,000.
摩爾定律,即戈登摩爾在 1975 年給出的版本,預(yù)測半導(dǎo)體密度將繼續(xù)快速增長,基本上每兩年翻一番,但我們開始偏離這一增長速度。偏離在2000 年左右出現(xiàn),并逐步擴(kuò)大。戈登在預(yù)測后的五十年后曾說道:沒有任何的物理事物可以持續(xù)成倍改變。當(dāng)然,摩爾定律不是定理或必須成立的真理,它是半導(dǎo)體行業(yè)的一個目標(biāo)。仔細(xì)觀察這條曲線,你會注意到在大約 50 年中,我們僅偏離了約 15 倍,但總共增長了近 10,000 倍。
So we"ve largely been able to keep on this curve but we began diverging and when you factor in increasing cost of new fab and new technologies and you see this curve when it"s converted to price per transistor not dropping nearly as fast as it once fell.
所以我們基本上能夠維持在這條曲線上,但我們確實開始跟不上了。如果你考慮到新晶圓廠和新技術(shù)的成本增加,當(dāng)它轉(zhuǎn)換為每個晶體管的價格時,你會看到這條曲線的下降速度不像曾經(jīng)下降的那么快。
We also have faced another problem, which is the end of so-called dennard scaling. Dennard scaling is an observation led by Robert Dennard, the inventor of DRAM that is ubiquitous in computing technology. He observes that as dimensions shrunk so would the voltage and other assonance for example. And that would result in nearly constant power per millimeter of silicon. That meant because of the amount of transistors that were in each millimeter we"re going up dramatically from one generation to the next, that power per computation was actually dropping quite quickly. That really came to a halt around 2007 and you see this red curb which was going up slowly at the beginning between 2000 and 2007 really began to take off. That meant that power was really the key issue and figuring out how to get energy efficiency would become more and more important as these technologies went forward.
我們還面臨另一個問題,即所謂的登納德縮放定律。登納德縮放定律是由羅伯特·登納德 領(lǐng)導(dǎo)的一項觀察實驗,他是DRAM的發(fā)明人。據(jù)他的觀察,隨著尺寸縮小,電壓和其他共振也會縮小,這將導(dǎo)致每毫米硅的功率幾乎恒定。這意味著由于每一毫米中的晶體管數(shù)量從一代到下一代急劇增加,每個計算的功率實際上下降得非??臁_@在 2007 年左右最為明顯,在 2000 年到 2007 年間開始緩慢上升的功耗開始激增。這意味著功耗確實是關(guān)鍵問題,隨著這些技術(shù)的發(fā)展,弄清楚如何獲得更高的能源效率將變得越來越重要。
Combine results of this is that we"ve seen a leveling off of unit processor performance, single core performance, after going through a rapid growth in the early period of the industry of roughly 25% a year and then a remarkable period with the introduction of RISC technologies, instructional-level parallelism, of over 50% a year and then a slower period which focused very much on multicore and building on these technologies.
在經(jīng)歷了行業(yè)早期每年大約 25% 的增長之后,隨著 RISC 技術(shù)的引入和指令級并行技術(shù)的出現(xiàn),開始有每年超過 50% 的性能增長。之后我們就迎來了多核時代,專注于在現(xiàn)有技術(shù)上進(jìn)行深耕。
In the last two years, only less than 5% improvement in performance per year. Even if you were to look at multicore designs with the inefficiencies that come about you see that that doesn"t significantly improve things across this.
在過去的兩年中,每年的性能提升不到 5%,即使多核設(shè)計也沒有顯著改善能效方面的問題。
And indeed we are in the we are in the era of dark silicon where multicore often slow down or shut off a core to prevent overheating and that overheating comes from power consumption.
事實上,我們正處于半導(dǎo)體寒冬。多核處理器還是會因為擔(dān)心過熱而限制自身的性能。而過熱的問題就來自功耗。
So what are we going to do? We"re in this dilemma here where we"ve got a new technology deep learning which seems able to do problems that we never thought we could do quite effectively. But it requires massive amounts of computing power to go forward and at the same time Moore"s law on the end of Dennard Scaling is creating a squeeze on the ability of the industry to do what it relies on for many years, namely just get the next generation of semiconductor technology everything gets faster.
那么我們能做什么呢?我們在這里陷入了兩難境地,我們擁有一項新技術(shù),深度學(xué)習(xí),它似乎能夠高效地解決很多問題,但同時它需要大量的算力才能進(jìn)步。同時,一邊我們有著登納德縮放定律,一邊有著摩爾定律,我們再也不能期待半導(dǎo)體技術(shù)的更新迭代能給我們帶來飛躍的性能增長。
So we have to think about a new solution. There are three possible directions to go.
因此,我們必須考慮新的解決方案。這里有三個可能的方向。
Software centric mechanisms where we look at improving the efficiency of our software so it makes more efficient use of the hardware, in particular the move to scripting languages such as python for example better dynamically-typed. They make programming very easy but they"re not terribly efficient as you will see in just a second.
以軟件為中心的機(jī)制。我們著眼于提高軟件的效率,以便更有效地利用硬件,特別是腳本語言,例如 python。這些語言讓編程變得非常簡單,但它們的效率并不高,接下來我會詳細(xì)解釋。
Hardware centric approaches. Can we change the way we think about the architecture of these machines to make them much more efficient? This approach is called domain specific architectures or domain specific accelerator. The idea is to just do a few tasks but to tune the hardware to do those tasks extremely well. We"ve already seen examples of this in graphics for example or modem that"s inside your cell phone. Those are special purpose architectures that use intensive computational techniques but are not general purpose. They are not programmed for arbitrary things. They are not designed to do a range of graphics operations or the operation is required by modem.
以硬件為中心的方法。我們能否改變我們對硬件架構(gòu)的設(shè)計,使它們更加高效?這種方法稱為特定領(lǐng)域架構(gòu)或特定領(lǐng)域加速器。這里的設(shè)計思路是讓硬件做特定的任務(wù),然后優(yōu)化要非常好。我們已經(jīng)在圖形處理或手機(jī)內(nèi)的調(diào)制解調(diào)器中看到了這樣的例子。這些使用的是密集計算技術(shù),不是用于通用運(yùn)算的,這也意味著它們不是設(shè)計來做各種各樣的運(yùn)算,它們旨在進(jìn)行圖形操作的安排或調(diào)制解調(diào)器需要的運(yùn)算。
And then of course some combinations of these. Can we come up with languages which match to these new domain specific architecture? Domain specific languages which improve the efficiency and let us code a range of applications very effectively.
最后是以上兩類的一些結(jié)合。我們是否能開發(fā)出與這些特定架構(gòu)相匹配的語言?特定領(lǐng)域語言可以提高效率,讓我們非常有效地開發(fā)應(yīng)用程序。
This is a fascinating slide from a paper that was done by Charles Leiserson and his colleagues at MIT and publish on Science called There"s plenty of room at the Top.
這是查理·雷瑟森和他在麻省理工學(xué)院的同事完成發(fā)表在《科學(xué)》雜志上的一篇論文內(nèi)容。論文名為“頂端有足夠的空間”。
What they want to do observe is that software efficiency and the inefficiency of matching software to hardware means that we have lots of opportunity to improve performance. They took admittedly a very simple program, matrix multiply, written initially in python and ran it on an 18 core Intel processor. And simply by rewriting the code from python to C they got a factor of 47 in improvement. Then introducing parallel loops gave them another factor of approximately eight.
他們想要觀察的是軟件效率,以及軟件與硬件匹配過程中帶來的低效率,這也意味著我們有很多提高效率的地方。他們在 18 核英特爾處理器上運(yùn)行了一個用 Python 編寫的簡單程序。把代碼從 Python 重寫為 C語言之后,他們就得到了 47 倍的效率改進(jìn)。引入并行循環(huán)后,又有了大約 8 倍的改進(jìn)。
Then introducing memory optimizations if you"re familiar with large scale metrics multiplied by doing it in blocked fashion you can dramatically improve the ability to use the cashe as effectively and thereby they got another factor a little under 20 from that about 15. And then finally using the vector instructions inside the Intel processor they were able to gain another factor of 10. Overall this final program runs more than 62,000 times faster than the initial python program.
引入內(nèi)存優(yōu)化后可以顯著提高緩存的使用效率,然后就又能獲得15~20倍的效率提高。然后最后使用英特爾處理器內(nèi)部的向量指令,又能夠獲得10 倍的改進(jìn)??傮w而言,這個最終程序的運(yùn)行速度比最初的 Python 程序快62,000 多倍。
Now this is not to say that you would get this for the larger scale programs or all kinds of environments but it"s an example of how much inefficiency is in at least for one simple application. Of course not many performance sensitive things are written in Python but even the improvement from C to the fully parallel version of C that uses SIMD instructions is similar to what you would get if you use the domain specific processor. It is significant just in its onw right. That"s nearly a factor of 100, more than 100, its almost 150.
當(dāng)然,這并不是說在更大規(guī)模的程序或所有環(huán)境下我們都可以取得這樣的提升,但它是一個很好的例子,至少能說明一個簡單的應(yīng)用程序也有效率改進(jìn)空間。當(dāng)然,沒有多少性能敏感的程序是用 Python 寫的。但從完全并行、使用SIMD 指令的C語言版本程序,它能獲得的效率提升類似于特定領(lǐng)域處理器。這已經(jīng)是很大的性能提升了,這幾乎是 100 的因數(shù),超過 100,幾乎是 150。
So there"s lots of opportunities here and that"s the key point behind us slide of an observation.
所以提升空間是很多的,這個研究的發(fā)現(xiàn)就是如此。
So what are these domain specific architecture? Their architecture is to achieve higher efficiency by telling the architecture the characteristics of the domain.
那么特定領(lǐng)域架構(gòu)是什么呢?這些架構(gòu)能讓架構(gòu)掌握特定領(lǐng)域的特征來實現(xiàn)更高的效率。
We"re not trying to do just one application but we"re trying to do a domain of applications like deep learning for example like computer graphics like virtual reality applications. So it"s different from a strict ASIC that is designed to only one function like a modem for example.
我們在做的不只是一個應(yīng)用程序,而是在嘗試做一個應(yīng)用程序領(lǐng)域,比如深度學(xué)習(xí),例如像虛擬現(xiàn)實、圖形處理。因此,它不同于ASIC,后者設(shè)計僅具有一個功能,就例如調(diào)制解調(diào)器。
It requires more domain specific knowledge. So we need to have a language which conveys important properties of the application that are hard to deduce if we start with a low level language like C. This is a product of codesign. We design the applications and the domain specific processor together and that"s critical to get these to to work together.
它需要更多特定領(lǐng)域的知識。所以我們需要一種語言來傳達(dá)應(yīng)用程序的重要屬性,如果我們從像 C 這樣的語言開始就很難推斷出這些屬性。這是協(xié)同設(shè)計的產(chǎn)物。我們一起設(shè)計應(yīng)用程序和特定領(lǐng)域的處理器,這對于讓它們協(xié)同工作至關(guān)重要。
Notice that these are not going to be things on which we run general purpose applications. It"s not the intention that we take 100 C code. It’s the intention that we take an application design to be run on that particular DSA and we use a domain specific language to convey the information to the application to the processor that it needs to get significant performance improvements.
請注意,這不是用來運(yùn)行通用軟件的。我們的目的不是要能夠運(yùn)行100 個 C 語言程序。我們的目的是讓應(yīng)用程序設(shè)計在特定的 DSA 上運(yùn)行,我們使用特定領(lǐng)域的語言將應(yīng)用程序的信息傳達(dá)給處理器,從而獲得顯著的性能提升。
The key goal here is to achieve higher efficiency both in the use of power and transistors. Remember those are two limiters the rate at which transistor growth is going forward and the issue of power from the lack of Denard scaling. So we"re trying to really improve the efficiency of that.
這里的關(guān)鍵目標(biāo)是在功率和晶體管方面實現(xiàn)更高的效率。請記住,晶體管增長的速度和登納德縮放定律是兩個限制因素,所以我們正在努力提高效率。
Good news? The good news here is that deep learning is a broadly applicable technology. It"s the new programming model, programming with data rather than writing massive amounts of highly specialized code. Use data to train deep learning model to detect that kind of specialized circumstance in the data.
有什么好消息嗎?好消息是深度學(xué)習(xí)是一種廣泛適用的技術(shù)。這是一種新的編程模型,使用數(shù)據(jù)進(jìn)行編程,而不是編寫大量高度專業(yè)化的代碼,而是使用數(shù)據(jù)訓(xùn)練深度學(xué)習(xí)模型來發(fā)現(xiàn)數(shù)據(jù)中的特殊情況。
And so we have a good target domain here. We have applications which are really demanding of massive amounts of performance increase through which we think there are appropriate domain specific architectures.
所以我們有一個很好的目標(biāo)域,我們有一些真正需要大量性能提升的應(yīng)用程序,因此我們認(rèn)為是有合適的特定領(lǐng)域架構(gòu)的。
It"s important to understand why these domain specific architectures can win in particular there"s no magic here.
我們需要弄明白這些特定領(lǐng)域架構(gòu)的優(yōu)勢。
People who are familiar with the books Dave Patterson and I co-authored together know that we believe in quantitative analysis in an engineering scientific approach to designing computers. So what makes these domain specific architectures more efficient?
熟悉大衛(wèi)·帕特森和我合著的書籍的人都知道,在計算機(jī)設(shè)計上,我們信奉遵循工程學(xué)方法論的量化分析。那么是什么讓這些特定領(lǐng)域架構(gòu)更高效呢?
First of all, they use a simple model for parallelism that works in a specific domain and that means they can have less control hardware. So for example we switch from multiple instruction multiple data models in a multicore to a single instruction data model. That means we dramatically improve the energy associated with fetching instructions because now we have to fetch one instruction rather than any instructions.
首先,他們使用一個簡單的并行模型,在特定領(lǐng)域工作,這意味著它們可以擁有更少的控制硬件。例如,我們從多核中的多指令多數(shù)據(jù)模型切換到單指令數(shù)據(jù)模型。這意味著我們顯著提高了與獲取指令相關(guān)的效率,因為現(xiàn)在我們必須獲取一條指令而不是任何指令。
We move to VLIW versus speculative out of order mechanisms, so things that rely on being able to analyze the code better know about dependences and therefore be able to create and structure parallelism at compile time rather than having to do with dynamically runtime.
我們來看看VLIW和推測性亂序機(jī)制的對比。現(xiàn)在需要更好處理代碼的也能夠得知其依附性,因此能夠在編譯時創(chuàng)建和構(gòu)建并行性,而不必進(jìn)行動態(tài)運(yùn)行。
Second we make more effective use of memory bandwidth. We go to user controlled memory system rather than caches. Caches are great except when you have large amounts of data does streaming through them. They"re extremely inefficient that"s not what they meant to do. They are meant to work when the program does repetitive things but it is somewhat in predictable fashion. Here we have repetitive things in a very predictable fashion but very large amounts of data.
其次,我們更有效地利用內(nèi)存帶寬。我們使用用戶控制的內(nèi)存系統(tǒng)而不是緩存。緩存是好東西,但是如果要處理大量數(shù)據(jù)的話就不會那么好使了,效率極低,緩存不是用來干這事的。緩存旨在在程序執(zhí)行具有重復(fù)性、可預(yù)測的操作時發(fā)揮作用。這里執(zhí)行的運(yùn)算雖然重復(fù)性高且可預(yù)測,但是數(shù)據(jù)量是在太大。
So we go to an alternative using prefetching and other techniques to move data into the memory once we get it into the memory within the processor within the domain specific processor. We can then make heavy use of the data before moving it back to the main memory.
那我們就用個別的方式。在我們把數(shù)據(jù)導(dǎo)入特定領(lǐng)域處理器上的內(nèi)存之后,我們采用預(yù)提取和其他技術(shù)手段將數(shù)據(jù)導(dǎo)入內(nèi)存中。接著,在我們需要把數(shù)據(jù)導(dǎo)去主存之前,我們就可以重度使用這些數(shù)據(jù)。
We eliminate unneeded accuracy. Turns out we need relatively much less accuracy then we do for general purpose computing here. In the case of integer, we need 8-16 bit integers. In the case of floating point, we need 16 to 32 bit not 64-bit large-scale floating point numbers. So we get efficiency thereby making data items smaller and by making the arithmetic operations more efficient.
我們消除了不需要的準(zhǔn)確性。事實證明,我們需要的準(zhǔn)確度比用于通用計算的準(zhǔn)確度要低得多。我們只需要8-16位整數(shù),要16到32位而不是64位的大規(guī)模浮點數(shù)。因此,我們通過使數(shù)據(jù)項變得更小而提高效率。
The key is that the domain specific programming model matches the application to the processor. These are not general purpose processor. You are not gonna take a piece of C code and throw it on one of these processors and be happy with the results. They"re designed to match a particular class of applications and that structure is determined by that interface in the domain specific language and the underlining architecture.
關(guān)鍵在于特定領(lǐng)域的編程模型將應(yīng)用程序與處理器匹配。這些不是通用處理器。你不會把一段 C 代碼扔到其中一個處理器上,然后對結(jié)果感到滿意。它們旨在匹配特定類別的應(yīng)用程序,并且該結(jié)構(gòu)由領(lǐng)域特定語言中的接口和架構(gòu)決定。
So this just shows you an example so you get an idea of how were using silicon rather differently in these environments then we would in a traditional processor.
這里我們來看一個例子,以便了解這些處理器與常規(guī)處理器的不同之處。
What I"ve done here is taken a first generation TPU-1 the first tensor processing unit from Google but I could take the second or third or fourth the numbers would be very similar. I show you what it looks like it"s a block diagram in terms of what the chip area devoted to. There"s a very large matrix multiply unit that can do a two 56 x 2 56 x 8 bit multiplies and the later ones actually have floating point versions of that multiplying. It has a unified buffer used for local activations of memory buffer, interfaces accumulators, a little bit of controls and interfaces to DRAM.
這里展示是谷歌的第一代 TPU-1 ,當(dāng)然我也可以采用第二、第三或第四代,但是它們帶來的結(jié)果是非常相似的。這些看起來像格子一樣的圖就是芯片各區(qū)域的分工。它有一個非常大的矩陣乘法單元,可以執(zhí)行兩個 56 x 2 56 x 8 位乘法,后者實具有浮點版本乘法。它有一個統(tǒng)一的緩沖區(qū),用于本地內(nèi)存激活。還有接口、累加器、DRAM。
Today that would be high bandwidth DRAMs early on it with DDR3. So if you look at the way in which the area is used. 44% of is used for memory to store temporary results in weights and things been computed. Almost 40% of being used for compute, 15% for the interfaces and 2% for control.
在今天我們使用的是高帶寬DRAM,以前可能用的是DDR3。那我們來具體看看這些區(qū)域的分工。 44% 用于內(nèi)存以短時間內(nèi)存儲運(yùn)算結(jié)果。 40% 用于計算,15% 用于接口,2% 用于控件。
Compare that to a single Skylake core from an Intel processor. In that case, 33% as being used for cach. So noticed that we have more memory capacity in the TPU then we have on the Skylake core. In fact if you were to remove the caps from the cache that number because that"s overhead it"s not real data, that number would even be larger. The amount on the Skylake core will probably drop to about 30% also almost 50% more being used for active data.
將其與英特爾的 Skylake架構(gòu)進(jìn)行比較。在這種情況下,33% 用于緩存。請注意,我們在 TPU 中擁有比在Skylake 核心上更多的內(nèi)存容量,事實上,如果移除緩存限制,這個數(shù)字甚至?xí)蟆?Skylake 核心上的數(shù)量可能會下降到大約 30%,用于活動數(shù)據(jù)的數(shù)量也會增加近 50%。
30% of the area is used for control. That"s because the Skylake core is an out of order dynamic schedule processor like most modern general purpose processors and that requires significantly more area for the control, roughly 15 times more area for control. That control is overhead. It’s energy intensive computation unfortunately the control unit. So it"s also a big power consumer. 21% for compute.
30% 的區(qū)域用于控制。這是因為與大多數(shù)現(xiàn)代通用處理器一樣,Skylake 核心是一個無序的動態(tài)調(diào)度處理器,需要更多的控制區(qū)域,大約是15 倍的區(qū)域。這種控制是額外負(fù)擔(dān)。不幸的是,控制單元是能源密集型計算,所以它也是一個能量消耗大戶。 21% 用于計算。
So noticed that the big advantage that exists here is the compute areas roughly almost double what it is in a Skylake core. Memory management there"s memory management overhead and finally miscellaneous overhead. so the Skylake core is using a lot more for control a lot less for compute and somewhat less for memory.
這里存在的最大優(yōu)勢是計算區(qū)域幾乎是 Skylake 核心的兩倍。內(nèi)存管理有內(nèi)存管理負(fù)擔(dān),最后是雜項負(fù)擔(dān)。因此,控制占據(jù)了Skylake 核心的區(qū)域,意味著用于計算的區(qū)域更少了,內(nèi)存也是同理。
So where does this bring us? We"ve come to an interesting time in the computing industry and I just want to conclude by reflecting on this and how saying something about how things are likely to go forward in the future because I think we"re at a real turning point at this point in the history of computing.
那么我們現(xiàn)在處于一個什么位置呢?我們來到了計算行業(yè)的一個有趣時期。我想通過分享一些我的個人思考、以及對未來的一些展望結(jié)束這場講演,因為我認(rèn)為我們正處在計算領(lǐng)域歷史的一個轉(zhuǎn)折點。
From 1960s, the introduction of the first real commercial computers, to 1980 we had largely vertically integrated companies.
從 1960 年代第一臺真正的商用計算機(jī)的出現(xiàn)到 1980 年,市面上的計算機(jī)公司基本上都是垂直整合的。
IBM Burroughs Honeywell be early spin outs out of the activity at the university of Pennsylvania that built ENIAC the first electronic computer.
IBM、寶來公司、霍尼韋爾、以及其他參與了賓夕法尼亞大學(xué)制造的世界上第一臺電子計算機(jī) ENIAC 公司都是垂直整合的公司。
IBM is the perfect example of a vertically integrated company in that period. They did everything, they built around chips they built the round disc"s in fact the West Coast operation of IBM here in California was originally open to do disc technology and the first Winchester discs were built on the West Coast.
IBM 是那個時期垂直整合公司的完美典范。IBM好像無所不能,他們圍繞著芯片制造,他們制造了光盤。事實上,IBM 在加利福尼亞的西海岸業(yè)務(wù)最初就是光盤技術(shù),而第一個溫徹斯特光盤就是在西海岸制造出來的。
They built their own processors. The 360, 370 series, etc. After that they build their own operating system they built their own compilers. They even built their own database estate. They built their networking software. In some cases, they even built application program but certainly the core of the system from the fundamental hardware up through the databases OS compilers were all built by IBM. And the driver here was technical concentration. IBM could put together the expertise across these wide set of things, assemble a world-class team and really optimize across the stack in a way that enabled their operating system to do things such as virtual memory long before other commercial activities can do that.
他們還構(gòu)建了自己的處理器,有360、370系列等等。之后他們開發(fā)了自己的操作系統(tǒng)、編譯器。他們甚至建立了自己的數(shù)據(jù)庫、自己的網(wǎng)絡(luò)軟件。他們甚至開發(fā)了應(yīng)用程序??梢钥隙ǖ氖?,從基礎(chǔ)硬件到數(shù)據(jù)庫、操作系統(tǒng)、編譯器等系統(tǒng)核心都是由 IBM 自己構(gòu)建的。而這里的驅(qū)動力是技術(shù)的集中。 IBM 可以將這些廣泛領(lǐng)域的專業(yè)知識整合在一起、組建一個世界一流的團(tuán)隊、并從而優(yōu)化整個堆棧,使他們的操作系統(tǒng)能夠做到虛擬內(nèi)存這種事,這可要比在其他公司要早得多。
And then the world changed, really changed with the introduction of the personal computer. And the beginning of the micro processors takes off.
接著出現(xiàn)了重大變化——個人電腦的推出和微處理器的崛起。
Then we change from a vertically organized industry to a horizontally organized industry. We had silicon manufacturers. Intel for example doing processors along with AMD and initially several other companies Fairchild and Motorola. We had a company like TSMC arise through silicon foundry making silicon for others. Something that didn"t exist in earlier but really in the late 80s and 90s really began to take off and that enabled other people to build chips for graphics or other other functions outside the processor.
接著這個行業(yè)從垂直轉(zhuǎn)變?yōu)樗娇v向的。我們有專精于做半導(dǎo)體的公司,例如英特爾和 AMD ,最初還有其他幾家公司例如仙童半導(dǎo)體和摩托羅拉。臺積電也通過代工崛起。這些在早期都是見不到的,但在 80 年代末和 90 年代開始逐漸起步,讓我們能夠做其它類型的處理器,例如圖形處理器等。
But Intel didn"t do everything. Intel did the processors and Microsoft then came along and did OS and compilers on top of that. And oracle companies like Oracle came along and build their applications databases and other applications on top of that. So they became very horizontally organized industry. The key drivers behind this, obviously the introduction of the personal computer.
但是英特爾并沒有一家公司包攬所有業(yè)務(wù)。英特爾專做處理器,然后微軟出現(xiàn)了,微軟做操作系統(tǒng)和編譯器。甲骨文等公司隨之出現(xiàn),并在此基礎(chǔ)上構(gòu)建他們的應(yīng)用程序數(shù)據(jù)庫和其他應(yīng)用程序。這個行業(yè)就變成了一個縱向發(fā)展等行業(yè)。這背后的關(guān)鍵驅(qū)動因素,顯然是個人電腦的出現(xiàn)。
The rise of shrinkwrap software, something a lot of us did not for see coming but really became a crucial driver, which meant that the number of architecture that you could easily support had to be kept fairly small because the software company is doing a shrink wrap software did not want to have to port and and verify that their software work done lots of different architectures.
軟件實體銷售等興起也是我們很多人沒有預(yù)料到的,但它確實成為了一個關(guān)鍵的驅(qū)動因素,這意味著必須要限制可支持的架構(gòu)數(shù)量,因為軟件公司不想因為架構(gòu)數(shù)量太多而需要進(jìn)行大量的移植和驗證工作。
And of course the rise in the dramatic growth of the general purpose microprocessor. This is the period in which microprocessor replaced all other technologies, including the largest super computer. And I think it happened much faster than we expected by the mid 80s microprocessor put a series dent in the mini computer business and it was struggling by the by the early 90s in the main from business and by the mid 90s to 2000s really taking a bite out of the super computer industry. So even the supercomputer industry converted from customize special architectures into an array of these general purpose microprocessor. They were just far too efficient in terms of cost and performance to be to be ignored.
當(dāng)然還有通用微處理器的快速增長。這是微處理器取代所有其他技術(shù)的時期,包括最大的超級計算機(jī)。我認(rèn)為它發(fā)生的速度比我們預(yù)期的要快得多,因為 80 年代中期,微處理器對微型計算機(jī)業(yè)務(wù)造成了一系列影響。到 90 年代初主要業(yè)務(wù)陷入困境,而到 90 年代中期到 2000 年代,它確實奪走了超級計算機(jī)行業(yè)的一些市場份額。因此,即使是超級計算機(jī)行業(yè),也從定制的特殊架構(gòu)轉(zhuǎn)變?yōu)橐幌盗械耐ㄓ梦⑻幚砥?,它們在成本和性能方面的效率實在是太高了,不容忽視?/p>
Now we"re all of a sudden in a new area where the new era not because general purpose processor is that gonna go completely go away. They going to remain to be important but they"ll be less centric to the drive to the edge to the ferry fastest most important applications with the domain specific processor will begin to play a key role. So rather than perhaps so much a horizontal we will see again a more vertical integration between the people who have the models for deep learning and machine learning systems the people who built the OS and compiler that enabled those to run efficiently train efficiently as well as be deployed in the field.
現(xiàn)在我們突然進(jìn)入了一個新時代。這并不意味著通用處理器會完全消失,它們?nèi)匀缓苤匾?,但它們將不是?qū)動行業(yè)發(fā)展的主力,能夠與軟件快速聯(lián)動的特定領(lǐng)域處理器將會逐漸發(fā)揮重大作用。因此,我們接下來或許會看到一個更垂直的行業(yè),會看到擁有深度學(xué)習(xí)和機(jī)器學(xué)習(xí)模型的開發(fā)者,與操作系統(tǒng)和編譯器的開發(fā)者之間更垂直的整合,使他們的程序能夠有效運(yùn)行、有效地訓(xùn)練以及進(jìn)入實際使用。
Inference is a critical part is it mean when we deploy these in the field will probably have lots of very specialized processors that do one particular problem. The processor that sits in a camera for example that"s a security camera that"s going to have a very limited used. The key is going to be optimize for power and efficiency in that key use and cost of course. So we see a different kind of integration and Microsoft Google and Apple are all looking at this.
程序推理是一個關(guān)鍵部分,這意味著當(dāng)我們進(jìn)行部署時,可能會有很多非常專業(yè)的處理器來處理一個特定的問題。例如,位于攝像頭中的處理器用途就非常有限。當(dāng)然,關(guān)鍵是優(yōu)化功耗和成本。所以我們看到了一種不同的整合方案。微軟、谷歌和蘋果都在關(guān)注這個領(lǐng)域。
The Apple M1 is a perfect example if you look at the Apple M1, it"s a processor designed by apple with a deep understanding of the applications that are likely to run on that processor. So they have a special purpose graphics processor they have a special purpose machine learning domain accelerator on there and then they have multiple cores, but even the cores are not completely homogeneous. Some are slow low power cores, and some are high speed high-performance higher power cores. So we see a completely different design approach with lots more codesign and vertical integration.
例如Apple M1,Apple M1 就是一個完美的例子,它是由 蘋果設(shè)計的處理器,對蘋果電腦上可能運(yùn)行的程序有著極好的優(yōu)化。他們有一個專用的圖形處理器、專用的機(jī)器學(xué)習(xí)領(lǐng)域加速器、有多個核心。即使是處理器核心也不是完全同質(zhì)的,有些是功耗低的、比較慢的核心,有些是高性能高功耗的核心。我們看到了一種完全不同的設(shè)計方法,有更多的協(xié)同設(shè)計和垂直整合。
We"re optimizing in a different way than we had in the past and I think this is going to slowly but surely change the entire computer industry, not the general purpose processor will go away and not the companies that make software that runs on multiple machines will completely go away but will have a whole new driver and the driver is created by the dramatic breakthroughs that we seen in deep learning and machine learning. I think this is going to make for a really interesting next 20 years.
我們正在以與過去不同的方式進(jìn)行優(yōu)化,這會是一個緩慢的過程,但肯定會改變整個計算機(jī)行業(yè)。我不是說通用處理器會消失,也不是說做多平臺軟件的公司將消失。我想說的是,這個行業(yè)會有全新的驅(qū)動力,由我們在深度學(xué)習(xí)和機(jī)器學(xué)習(xí)中看到的巨大突破創(chuàng)造的驅(qū)動力。我認(rèn)為這將使未來 20 年變得非常有趣。
Thank you for your kind attention and I"d like to wish the 2021 T-EDGE conference a great success. Thank you.
最后,你耐心地聽完我這次演講。我也預(yù)祝 2021 年 T-EDGE 會議取得圓滿成功,謝謝。
網(wǎng)站首頁 |網(wǎng)站簡介 | 關(guān)于我們 | 廣告業(yè)務(wù) | 投稿信箱
Copyright © 2000-2020 hngelin.com All Rights Reserved.
中國網(wǎng)絡(luò)消費(fèi)網(wǎng) 版權(quán)所有 未經(jīng)書面授權(quán) 不得復(fù)制或建立鏡像
聯(lián)系郵箱:920 891 263@qq.com
象山县| 同江市| 龙江县| 哈尔滨市| 宜昌市| 资兴市| 鱼台县| 法库县| 马关县| 淮安市| 安徽省| 阳原县| 确山县| 凤山县| 绵竹市| 榆树市| 漯河市| 望城县| 宜黄县| 湘潭市| 东安县| 灌阳县| 罗田县| 盈江县| 丁青县| 封丘县| 南涧| 合川市| 玉山县| 博野县| 尼玛县| 泰宁县| 精河县| 修水县| 丰宁| 无极县| 汝阳县| 娄底市| 古蔺县| 江孜县| 名山县|