[mpiwg-ft] FTWG Con Call Today

Van Der Wijngaart, Rob F rob.f.van.der.wijngaart at intel.com
Wed Sep 13 14:16:31 CDT 2017


Interesting paper for this forum: https://arxiv.org/abs/1709.03316
What does fault tolerant Deep Learning need from MPI?

Vinay Amatya, Abhinav Vishnu, Charles Siegel, Jeff Daily
(Submitted on 11 Sep 2017)
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.

-----Original Message-----
From: mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org] On Behalf Of Bland, Wesley
Sent: Wednesday, September 13, 2017 9:08 AM
To: FTWG <mpiwg-ft at lists.mpi-forum.org>
Subject: [mpiwg-ft] FTWG Con Call Today

The Fault Tolerance Working Group’s biweekly con call is today at 3:00 PM Eastern. Today's agenda:

* Go over F2F agenda and slides.

If there's something else that people would like to discuss, please just send an email to the WG so we can get it on the agenda.

Thanks, 
Wesley 

......................................................................................................................................... 
Join from PC, Mac, Linux, iOS or Android: https://tennessee.zoom.us/j/632356722?pwd=lI4%2F169CGcewIumekTziMw%3D%3D
   Password: mpiforum

Or iPhone one-tap (US Toll):  +14086380968,632356722# or +16465588656,632356722#

Or Telephone:
   Dial: +1 408 638 0968 (US Toll) or +1 646 558 8656 (US Toll)
   Meeting ID: 632 356 722
   International numbers available: https://tennessee.zoom.us/zoomconference?m=GscM59o_Qoig8v4aJl1OrsnXL-7Blrke

Or an H.323/SIP room system:
   H.323: 162.255.37.11 (US West) or 162.255.36.11 (US East) 
   Meeting ID: 632 356 722
   Password: 366244

   SIP: 632356722 at zoomcrc.com
   Password: 366244
......................................................................................................................................... 

_______________________________________________
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft


More information about the mpiwg-ft mailing list