TePDist (TEnsor Program Distributed) is an automatic distributed training system infrastructure for DL models, not just an algorithm. The TePDist system operates in client/server mode. The client should be any front end that can generate XLA HLOs. The server is responsible for distributed policy planning and automatic distributed task initiation. The motivation for decoupling the client and server is to facilitate future integration with different front-end frameworks. TePDist has its own runtime graph and task scheduler for distributed execution. The TePDist system is now based on previous versions of community TensorFlow… |
#Automatic #Distributed #Training #System #Infrastructure #TePDist