Groups Similar Look up By Text Browse About



Similar articles
Article Id Title Prob Score Similar Compare
217934 VENTUREBEAT 2021-10-11:
Microsoft and Nvidia team up to train one of the world’s largest language models
1.000 Find similar Compare side-by-side
218014 ZDNET 2021-10-11:
Microsoft and Nvidia create 105-layer, 530 billion parameter language model that needs 280 A100 GPUs, but it's still biased
0.952 0.665 Find similar Compare side-by-side
217942 VENTUREBEAT 2021-10-11:
Microsoft taps AI techniques to bring Translator to 100 languages
0.703 0.561 Find similar Compare side-by-side
218175 ZDNET 2021-10-14:
The state of AI in 2021: Machine learning in production, MLOps and data-centric AI
0.432 Find similar Compare side-by-side
217885 ZDNET 2021-10-13:
Microsoft Translator now works across 103 languages
0.007 0.399 Find similar Compare side-by-side
218122 VENTUREBEAT 2021-10-14:
Facebook introduces dataset and benchmarks to make AI more ‘egocentric’
0.358 Find similar Compare side-by-side
218078 ARSTECHNICA 2021-10-14:
How a mass extinction resulted in the rise of the snakes
0.347 Find similar Compare side-by-side
217925 VENTUREBEAT 2021-10-10:
AI lab DeepMind becomes profitable and bolsters relationship with Google
0.327 Find similar Compare side-by-side
217957 VENTUREBEAT 2021-10-11:
Facebook quietly acquires synthetic data startup AI.Reverie
0.325 Find similar Compare side-by-side
217770 VENTUREBEAT 2021-10-12:
Cloud optimization startup Cast AI raises $10M
0.320 Find similar Compare side-by-side
218011 ZDNET 2021-10-11:
Amazon AWS's AI team seeks the profound in the industrial
0.320 Find similar Compare side-by-side
217948 VENTUREBEAT 2021-10-11:
DeepMind proposes new benchmark to improve robots’ object-stacking abilities
0.293 Find similar Compare side-by-side
217875 VENTUREBEAT 2021-10-12:
DeepMind is developing one algorithm to rule them all
0.288 Find similar Compare side-by-side
217802 ZDNET 2021-10-13:
Opendoor discusses the secret sauce: 'A deeper mechanism to the world'
0.275 Find similar Compare side-by-side
218002 ZDNET 2021-10-11:
Researchers develop AI system to improve eye disease detection
0.269 Find similar Compare side-by-side
217776 VENTUREBEAT 2021-10-12:
AI edge chip startup Hailo lands $136M
0.264 Find similar Compare side-by-side
217930 VENTUREBEAT 2021-10-11:
Harness your data to make weak AI your strength
0.259 Find similar Compare side-by-side
217928 TECHREPUBLIC 2021-10-11:
Python ends C and Java's 20-year reign atop the TIOBE index
0.258 Find similar Compare side-by-side
217954 VENTUREBEAT 2021-10-11:
Streamlit, which helps data scientists build apps, hits version 1.0
0.255 Find similar Compare side-by-side
217856 ARSTECHNICA 2021-10-10:
These virtual obstacle courses help real robots learn to walk
0.254 Find similar Compare side-by-side
217919 ZDNET 2021-10-12:
Why graph DB + AI may be the future of data management
0.252 Find similar Compare side-by-side
217859 VENTUREBEAT 2021-10-12:
No-code AI analytics may soon automate data science jobs
0.251 Find similar Compare side-by-side
218116 VENTUREBEAT 2021-10-14:
Geospatial analytics startup AiDash lands $27M
0.246 Find similar Compare side-by-side
218178 ZDNET 2021-10-14:
Train at your own pace to become a Java programmer for only $10
0.244 Find similar Compare side-by-side
217874 ARSTECHNICA 2021-10-12:
IBM says AI can help track carbon pollution across vast supply chains
0.244 Find similar Compare side-by-side

1

ID: 217934

URL: https://venturebeat.com/2021/10/11/microsoft-and-nvidia-team-up-to-train-one-of-the-worlds-largest-language-models/

Date: 2021-10-11

Microsoft and Nvidia team up to train one of the world’s largest language models

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now! Microsoft and Nvidia today announced that they trained what they claim is the largest and most capable AI-powered language model to date: Megatron-Turing Natural Language Generation (MT-NLP). The successor to the companies Turing NLG 17B and Megatron-LM models, MT-NLP contains 530 billion parameters and achieves unmatched accuracy in a broad set of natural language tasks, Microsoft and Nvidia say — including reading comprehension, commonsense reasoning, and natural language inferences. The quality and results that we have obtained today are a big step forward in the journey towards unlocking the full promise of AI in natural language. The innovations of DeepSpeed and Megatron-LM will benefit existing and future AI model development and make large AI models cheaper and faster to train, Nvidias senior director of product management and marketing for accelerated computing, Paresh Kharya, and group program manager for the Microsoft Turing team, Ali Alvi wrote in a blog post. We look forward to how MT-NLG will shape tomorrows products and motivate the community to push the boundaries of natural language processing (NLP) even further. The journey is long and far from complete, but we are excited by what is possible and what lies ahead. In machine learning, parameters are the part of the model thats learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. Language models with large numbers of parameters, more data, and more training time have been shown to acquire a richer, more nuanced understanding of language, for example gaining the ability to summarize books and even complete programming code. To train MT-NLG, Microsoft and Nvidia say that they created a training dataset with 270 billion tokens from English-language websites. Tokens, a way of separating pieces of text into smaller units in natural language, can either be words, characters, or parts of words. Like all AI models, MT-NLP had to train by ingesting a set of examples to learn patterns among data points, like grammatical and syntactical rules. The dataset largely came from The Pile, an 835GB collection of 22 smaller datasets created by the open source AI research effort EleutherAI. The Pile spans academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), and more, which Microsoft and Nvidia say they curated and combined with filtered snapshots of the Common Crawl, a large collection of webpages including news stories and social media posts. The data used to train MT-NLP. Training took place across 560 Nvidia DGX A100 servers, each containing 8 Nvidia A100 80GB GPUs. When benchmarked, Microsoft says that MT-NLP can infer basic mathematical operations even when the symbols are badly obfuscated. While not extremely accurate, the model seems to go beyond memorization for arithmetic and manages to complete tasks containing questions that prompt it for an answer, a major challenge in NLP. Its well-established that models like MT-NLP can amplify the biases in data on which they were trained, and indeed, Microsoft and Nvidia acknowledge that the model picks up stereotypes and biases from the [training] data. Thats likely because a portion of the dataset was sourced from communities with pervasive gender, race, physical, and religious prejudices, which curation cant completely address. In a paper, the Middlebury Institute of International Studies Center on Terrorism, Extremism, and Counterterrorism claim that GPT-3 and similar models can generate informational and influential text that might radicalize people into far-right extremist ideologies and behaviors. A group at Georgetown University has used GPT-3 to generate misinformation, including stories around a false narrative, articles altered to push a bogus perspective, and tweets riffing on particular points of disinformation. Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular open source models, including Googles BERT,  XLNet, and Facebooks RoBERTa. Microsoft trains a 530billion parameter GPT3-style language model. This is the largest LM in existence. (There's also the mysterious multi-modal 1.5trillion+ ' Wu Dao' MOE model but little known about it). Microsoft trains on 'The Pile' dataset. Microsoft and Nvidia claim that theyre committed to working on addressing [the] problem and encourage continued research to help in quantifying the bias of the model. They also say that any use of Megatron-Turing in production must ensure that proper measures are put in place to mitigate and minimize potential harm to users, and follow tenets such as those outlined in Microsofts Responsible AI Principles. We live in a time [when] AI advancements are far outpacing Moores law. We continue to see more computation power being made available with newer generations of GPUs, interconnected at lightning speeds. At the same time, we continue to see hyper-scaling of AI models leading to better performance, with seemingly no end in sight, Kharya and Alvi continued. Marrying these two trends together are software innovations that push the boundaries of optimization and efficiency. Projects like MT-NLP, AI21 Labs Jurassic-1, Huaweis PanGu-Alpha, Navers HyperCLOVA, and the Beijing Academy of Artificial Intelligences Wu Dao 2.0 are impressive from an academic standpoint, but building them doesnt come cheap. For example, the training dataset for OpenAIs GPT-3 — one of the worlds largest language models — was 45 terabytes in size, enough to fill 90 500GB hard drives. AI training costs dropped 100-fold between 2017 and 2019, according to one source, but the totals still exceed the compute budgets of most startups. The inequity favors corporations with extraordinary access to resources at the expense of small-time entrepreneurs, cementing incumbent advantages. For example, OpenAIs GPT-3 required an estimated 3.1423^23 floating-point operations per second (FLOPS) of compute during training. In computer science, FLOPS is a measure of raw processing performance, typically used to compare different types of hardware. Assuming OpenAI reserved 28 teraflops — 28 trillion floating-point operations per second — of compute across a bank of Nvidia V100 GPUs, a common GPU available through cloud services, itd take $4.6 million for a single training run. One Nvidia RTX 8000 GPU with 15 teraflops of compute would be substantially cheaper — but itd take 665 years to finish the training. Microsoft and Nvidia says that it observed between 113 to 126 teraflops per second per GPU while training MT-NLP. The cost is likely to have been in the millions of dollars. A Synced report estimated that a fake news detection model developed by researchers at the University of Washington cost $25,000 to train, and Google spent around $6,912 to train a language model called BERT that it used to improve the quality of Google Search results. Storage costs also quickly mount when dealing with datasets at the terabyte — or petabyte — scale. To take an extreme example, one of the datasets accumulated by Teslas self-driving team — 1.5 petabytes of video footage — would cost over $67,500 to store in Azure for three months, according to CrowdStorage. The effects of AI and machine learning model training on the environment have also been brought into relief. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly five times the lifetime emissions of the average U.S. car. OpenAI itself has conceded that models like Codex require significant amounts of compute — on the order of hundreds of petaflops per day — which contributes to carbon emissions. In a sliver of good news, the cost for FLOPS and basic machine learning operations has been falling over the past few years. A 2020 OpenAI survey found that since 2012, the amount of compute needed to train a model to the same performance on classifying images in a popular benchmark — ImageNet — has been decreasing by a factor of two every 16 months. Other recent research suggests that large language models arent always more complex than smaller models, depending on the techniques used to train them. Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says when it comes to natural language, its an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain. The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets, Antoniak told VentureBeat in a previous interview. These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.