快速上手

这里将提供一个快速上手bert4keras的基本教程。

基本例子

让我们来重温首页上的简单例子，它包含了调用BERT base来编码句子的完整流程：

from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer
import numpy as np

config_path = '/root/kg/bert/chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = '/root/kg/bert/chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = '/root/kg/bert/chinese_L-12_H-768_A-12/vocab.txt'

tokenizer = Tokenizer(dict_path, do_lower_case=True)  # 建立分词器
model = build_transformer_model(config_path, checkpoint_path)  # 建立模型，加载权重

# 编码测试
token_ids, segment_ids = tokenizer.encode(u'语言模型')

print('
 ===== predicting =====
')
print(model.predict([np.array([token_ids]), np.array([segment_ids])]))

这个例子虽然简短，但事实上已经包含了在Keras中使用BERT模型的完整流程，事实上，对于已经比较熟悉Keras的用户来说，仅凭这个例子就可以自行搭建基于BERT的模型了，因为当model = build_transformer_model(config_path, checkpoint_path)这一步成功执行后，一个基于Keras的BERT模型就已经搭建完毕，剩下的都是Keras的使用了。

从例子可以看到，其实代码分为两部分：第一部分是tokenizer的建立，bert4keras.tokenizers里边包含了对原版BERT的tokenizer的完整复现，同时还补充了一下常用的功能；第二部分就是BERT模型的建立，其主要函数是build_transformer_model，其定义如下：

def build_transformer_model(
    config_path=None,  # 模型的配置文件（对应的文件为json格式）
    checkpoint_path=None,  # 模型的预训练权重（tensorflow的ckpt格式）
    model='bert',  # 模型的类型（bert、albert、albert_unshared、nezha、electra、gpt2_ml、t5）
    application='encoder',  # 模型的用途（encoder、lm、unilm）
    return_keras_model=True,  # 返回Keras模型，还是返回bert4keras的模型类
    **kwargs  # 其他传递参数
):

build_transformer_model各参数的含义很难用几句话表达清楚，不过在这个10分钟教程里，这些细节并不是特别重要，所以暂时略去。学习一个框架最好的方法还是多看例子，所以还是恳请用户多参考github上提供的examples。

支持模型

bert4keras支持搭建和加载权重的预训练模型还是比较多的，在同类程序中应该仅次于huggingface的transformers。目前支持的预训练模型包括：

Google原版bert: https://github.com/google-research/bert
brightmart版roberta: https://github.com/brightmart/roberta_zh
哈工大版roberta: https://github.com/ymcui/Chinese-BERT-wwm
Google原版albert[例子]: https://github.com/google-research/ALBERT
brightmart版albert: https://github.com/brightmart/albert_zh
转换后的albert: https://github.com/bojone/albert_zh
华为的NEZHA: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA
自研语言模型: https://github.com/ZhuiyiTechnology/pretrained-models
T5模型: https://github.com/google-research/text-to-text-transfer-transformer
GPT2_ML: https://github.com/imcaspar/gpt2-ml
Google原版ELECTRA: https://github.com/google-research/electra
哈工大版ELECTRA: https://github.com/ymcui/Chinese-ELECTRA
CLUE版ELECTRA: https://github.com/CLUEbenchmark/ELECTRA

注意事项

注1：brightmart版albert的开源时间早于Google版albert，这导致早期brightmart版albert的权重与Google版的不完全一致，换言之两者不能直接相互替换。为了减少代码冗余，bert4keras的0.2.4及后续版本均只支持加载Google版以brightmart版中带Google字眼的权重。如果要加载早期版本的权重，请用0.2.3版本，或者考虑作者转换过的albert_zh。
注2：下载下来的ELECTRA权重，如果没有json配置文件的话，参考这里自己改一个。