As BERT, Megatron, GPT-3 and other pre-training models have achieved remarkable results in the field of NLP, more and more teams are devoted to ultra-large-scale training, which makes the scale of training models develop from 100 million to 100 billion or even trillions scale. However, there are still some challenges in applying such very large-scale models to real-world scenarios. First, the large number of model parameters makes the training and inference speed too slow and the deployment cost is extremely high; secondly, the problem of insufficient data volume in many practical scenarios still restricts the application of large models in small sample scenarios. The generalization of sample scenarios remains a challenge. In response to the above problems, the Alibaba Cloud Machine Learning PAI team has launched the EasyNLP Chinese NLP algorithm framework to help large models land quickly and efficiently.

  • Easy to use and compatible with open source: EasyNLP supports commonly used Chinese NLP data and models, which is convenient for users to evaluate Chinese NLP technology. In addition to providing an easy-to-use and concise PAI command form to call cutting-edge NLP algorithms, EasyNLP also abstracts certain custom modules such as AppZoo and ModelZoo to lower the threshold for NLP applications. Models, including knowledge pre-training models, etc. EasyNLP can seamlessly access huggingface/transformer models, is also compatible with EasyTransfer models, and can improve training efficiency with the help of the distributed training framework (based on Torch-Accelerator) that comes with the framework.
  • Large model and small sample landing technology: The EasyNLP framework integrates a variety of classic small-sample learning algorithms, such as PET, P-Tuning, etc., to achieve small-sample data tuning based on large models, so as to solve the problem that the large model does not match the small training set. In addition, the PAI team combined the classical small-sample learning algorithm and the idea of ​​comparative learning, and proposed a scheme called Contrastive Prompt Tuning, which does not add any new parameters or manually set templates and label words, which won the first place in the FewCLUE small-sample learning list. Name, compared to Finetune has more than 10% improvement.
  • Large Model Knowledge Distillation Technology: In view of the problem that large model parameters are difficult to implement, EasyNLP provides a knowledge distillation function to help distill large models to obtain efficient small models to meet the needs of online deployment services. At the same time, EasyNLP provides the MetaKD algorithm, which supports meta-knowledge distillation and improves the effect of the student model. In many fields, it can even match the effect of the teacher model. At the same time, EasyNLP supports data enhancement, which can effectively improve the effect of knowledge distillation by enhancing the data in the target field through pre-training models.
  • Multimodal Model Technology: Since many NLP tasks need to be completed with the help of the representation of other modalities,The EasyNLP framework not only supports pure NLP tasks, it also supports various popular multimodal pretrained models to support NLP tasks that require visual knowledge or visual features. For example, EasyNLP integrates a CLIP model for text-image matching and a DALLE-style Chinese model for text-to-image generation.
$ git clone https://github.com/alibaba/EasyNLP.git
$ pip install -r requirements.txt 
$ cd EasyNLP
$ python setup.py install

Environment requirements: Python 3.6, PyTorch >= 1.8.

The following provides an example of BERT text classification. It only takes a few lines of code to train a BERT model:

First, load the data through the load_dataset interface, then build a classification model, and then call the Trainer to train.

from easynlp.core import Trainer
from easynlp.appzoo import GeneralDataset, SequenceClassification, load_dataset
from easynlp.utils import initialize_easynlp

args = initialize_easynlp()

row_data = load_dataset('glue', 'qnli')["train"]
train_dataset = GeneralDataset(row_data, args.pretrained_model_name_or_path, args.sequence_length)

model = SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)
Trainer(model=model,  train_dataset=train_dataset).train()

For more datasets, please check it out in DataHub.

It is also possible to use a custom data interface:

from easynlp.core import Trainer
from easynlp.appzoo import ClassificationDataset, SequenceClassification
from easynlp.utils import initialize_easynlp

args = initialize_easynlp()

train_dataset = ClassificationDataset(
    pretrained_model_name_or_path=args.pretrained_model_name_or_path,
    data_file=args.tables,
    max_seq_length=args.sequence_length,
    input_schema=args.input_schema,
    first_sequence=args.first_sequence,
    label_name=args.label_name,
    label_enumerate_values=args.label_enumerate_values,
    is_training=True)

model = SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)
Trainer(model=model,  train_dataset=train_dataset).train()

Test code:

python main.py \
  --mode train \
  --tables=train_toy.tsv \
  --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
  --first_sequence=sent1 \
  --label_name=label \
  --label_enumerate_values=0,1 \
  --checkpoint_dir=./tmp/ \
  --epoch_num=1  \
  --app_name=text_classify \
  --user_defined_parameters='pretrain_model_name_or_path=bert-tiny-uncased'

We also provide AppZoo’s command line to train the model, which can be started with simple parameter configuration:

$ easynlp \
   --mode=train \
   --worker_gpu=1 \
   --tables=train.tsv,dev.tsv \
   --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
   --first_sequence=sent1 \
   --label_name=label \
   --label_enumerate_values=0,1 \
   --checkpoint_dir=./classification_model \
   --epoch_num=1  \
   --sequence_length=128 \
   --app_name=text_classify \
   --user_defined_parameters='pretrain_model_name_or_path=bert-small-uncased'
$ easynlp \
  --mode=predict \
  --tables=dev.tsv \
  --outputs=dev.pred.tsv \
  --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
  --output_schema=predictions,probabilities,logits,output \
  --append_cols=label \
  --first_sequence=sent1 \
  --checkpoint_path=./classification_model \
  --app_name=text_classify

For more examples of AppZoo, see:AppZoo Documentation.

ModelZoo of EasyNLP currently supports the following pre-trained models.

  1. PAI-BERT-zh (from Alibaba PAI): pre-trained BERT models with a large Chinese corpus.
  2. DKPLM (from Alibaba PAI): released with the paper DKPLM: Decomposable Knowledge-enhanced Pre-trained Language Model for Natural Language Understanding by Taolin Zhang, Chengyu Wang, Nan Hu, Minghui Qiu, Chengguang Tang, Xiaofeng He and Jun Huang.
  3. KGBERT (from Alibaba Damo Academy & PAI): pre-train BERT models with knowledge graph embeddings injected.
  4. BERT (from Google): released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  5. RoBERTa (from Facebook): released with the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer and Veselin Stoyanov.
  6. Chinese RoBERTa (from HFL): the Chinese version of RoBERTa.
  7. MacBERT (from HFL): released with the paper Revisiting Pre-trained Models for Chinese Natural Language Processing by Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang and Guoping Hu.
  8. WOBERT (from ZhuiyiTechnology): the word-based BERT for the Chinese language.
  9. Mengzi (from Langboat): released with the paper Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese by Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua, Yulong Wang and Ming Zhou.

For a detailed list see readme.

EasyNLP provides small sample learning and knowledge distillation, which is convenient for users to implement large pre-training models.

  1. PET (from LMU Munich and Sulzer GmbH): released with the paper Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference by Timo Schick and Hinrich Schutze. We have made some slight modifications to make the algorithm suitable for the Chinese language.
  2. P-Tuning (from Tsinghua University, Beijing Academy of AI, MIT and Recurrent AI, Ltd.): released with the paper GPT Understands, Too by Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang and Jie Tang. We have made some slight modifications to make the algorithm suitable for the Chinese language.
  3. CP-Tuning (from Alibaba PAI): released with the paper Making Pre-trained Language Models End-to-end Few-shot Learners with Contrastive Prompt Tuning by Ziyun Xu, Chengyu Wang, Minghui Qiu, Fuli Luo, Runxin Xu, Songfang Huang and Jun Huang.
  4. Vanilla KD (from Alibaba PAI): distilling the logits of large BERT-style models to smaller ones.
  5. Meta KD (from Alibaba PAI): released with the paper Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains by Haojie Pan, Chengyu Wang, Minghui Qiu, Yichang Zhang, Yaliang Li and Jun Huang.
  6. Data Augmentation (from Alibaba PAI): augmenting the data based on the MLM head of pre-trained language models.

EasyNLP provides CLUE evaluation codewhich is convenient for users to quickly evaluateCLUE dataon the model effect.

# Format: bash run_clue.sh device_id train/predict dataset
# e.g.: 
bash run_clue.sh 0 train csl

According to our script, the evaluation effect of BERT, RoBERTa and other models can be obtained (dev data):

(1) bert-base-chinese






TaskAFQMCCMNLICSLIFLYTEKOCNLITNEWSWSC
P72.17%75.74%80.93%60.22%78.31%57.52%75.33%
F152.96%75.74%81.71%60.22%78.30%57.52%80.82%

(2) chinese-roberta-wwm-ext:






TaskAFQMCCMNLICSLIFLYTEKOCNLITNEWSWSC
P73.10%80.75%80.07%60.98%80.75%57.93%86.84%
F156.04%80.75%81.50%60.98%80.75%57.93%89.58%

For detailed examples, please refer toCLUE evaluation example.

This project is licensed under the Apache License (Version 2.0). This toolkit also contains some code modified from other repos under other open-source licenses. See the NOTICE file for more information.

Scan the QR code below to join the dingidng group. If you have any questions, please feel free to give feedback in the group.

For a more detailed explanation, please refer to our arxiv article.

@article{easynlp,
  doi = {10.48550/ARXIV.2205.00258},  
  url = {https://arxiv.org/abs/2205.00258},  
  author = {Wang, Chengyu and Qiu, Minghui and Zhang, Taolin and Liu, Tingting and Li, Lei and Wang, Jianing and Wang, Ming and Huang, Jun and Lin, Wei},
  title = {EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing},
  publisher = {arXiv},  
  year = {2022}
}

#EasyNLP #Homepage #Documentation #Downloads #Chinese #NLP #Algorithm #Framework #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *