zoukankan      html  css  js  c++  java
  • 梅尔频谱(mel-spectrogram)提取,griffin_lim声码器【python代码分析】

    在语音分析,合成,转换中,第一步往往是提取语音特征参数。
    利用机器学习方法进行上述语音任务,常用到梅尔频谱。
    本文介绍从音频文件提取梅尔频谱,和从梅尔频谱变成音频波形。

    从音频波形提取Mel频谱:

    对音频信号预加重、分帧和加窗
    对每帧信号进行短时傅立叶变换STFT,得到短时幅度谱
    短时幅度谱通过Mel滤波器组得到Mel频谱
    从Mel频谱重建音频波形

    Mel频谱转换成幅度谱
    griffin_lim声码器算法重建波形
    去加重
    声码器有很多种,比如world,straight等,但是griffin_lim是特殊的,它不需要相位信息就可以重频谱重建波形,实际上它根据帧之间的关系估计相位信息。和成的音频质量也较高,代码也比较简单。
    音频波形 到 mel-spectrogram

    sr = 24000 # Sample rate.
    n_fft = 2048 # fft points (samples)
    frame_shift = 0.0125 # seconds
    frame_length = 0.05 # seconds
    hop_length = int(sr*frame_shift) # samples.
    win_length = int(sr*frame_length) # samples.
    n_mels = 512 # Number of Mel banks to generate
    power = 1.2 # Exponent for amplifying the predicted magnitude
    n_iter = 100 # Number of inversion iterations
    preemphasis = .97 # or None
    max_db = 100
    ref_db = 20
    top_db = 15
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    def get_spectrograms(fpath):
    '''Returns normalized log(melspectrogram) and log(magnitude) from `sound_file`.
    Args:
    sound_file: A string. The full path of a sound file.

    Returns:
    mel: A 2d array of shape (T, n_mels) <- Transposed
    mag: A 2d array of shape (T, 1+n_fft/2) <- Transposed
    '''
    # Loading sound file
    y, sr = librosa.load(fpath, sr=sr)

    # Trimming
    y, _ = librosa.effects.trim(y, top_db=top_db)

    # Preemphasis
    y = np.append(y[0], y[1:] - preemphasis * y[:-1])

    # stft
    linear = librosa.stft(y=y,
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=win_length)

    # magnitude spectrogram
    mag = np.abs(linear) # (1+n_fft//2, T)

    # mel spectrogram
    mel_basis = librosa.filters.mel(sr, n_fft, n_mels) # (n_mels, 1+n_fft//2)
    mel = np.dot(mel_basis, mag) # (n_mels, t)

    # to decibel
    mel = 20 * np.log10(np.maximum(1e-5, mel))
    mag = 20 * np.log10(np.maximum(1e-5, mag))

    # normalize
    mel = np.clip((mel - ref_db + max_db) / max_db, 1e-8, 1)
    mag = np.clip((mag - ref_db + max_db) / max_db, 1e-8, 1)

    # Transpose
    mel = mel.T.astype(np.float32) # (T, n_mels)
    mag = mag.T.astype(np.float32) # (T, 1+n_fft//2)

    return mel, mag

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    mel-spectrogram 到 音频波形

    def melspectrogram2wav(mel):
    '''# Generate wave file from spectrogram'''
    # transpose
    mel = mel.T

    # de-noramlize
    mel = (np.clip(mel, 0, 1) * max_db) - max_db + ref_db

    # to amplitude
    mel = np.power(10.0, mel * 0.05)
    m = _mel_to_linear_matrix(sr, n_fft, n_mels)
    mag = np.dot(m, mel)

    # wav reconstruction
    wav = griffin_lim(mag)

    # de-preemphasis
    wav = signal.lfilter([1], [1, -preemphasis], wav)

    # trim
    wav, _ = librosa.effects.trim(wav)

    return wav.astype(np.float32)

    def spectrogram2wav(mag):
    '''# Generate wave file from spectrogram'''
    # transpose
    mag = mag.T

    # de-noramlize
    mag = (np.clip(mag, 0, 1) * max_db) - max_db + ref_db

    # to amplitude
    mag = np.power(10.0, mag * 0.05)

    # wav reconstruction
    wav = griffin_lim(mag)

    # de-preemphasis
    wav = signal.lfilter([1], [1, -preemphasis], wav)

    # trim
    wav, _ = librosa.effects.trim(wav)

    return wav.astype(np.float32)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    几个辅助函数:

    def _mel_to_linear_matrix(sr, n_fft, n_mels):
    m = librosa.filters.mel(sr, n_fft, n_mels)
    m_t = np.transpose(m)
    p = np.matmul(m, m_t)
    d = [1.0 / x if np.abs(x) > 1.0e-8 else x for x in np.sum(p, axis=0)]
    return np.matmul(m_t, np.diag(d))

    def griffin_lim(spectrogram):
    '''Applies Griffin-Lim's raw.
    '''
    X_best = copy.deepcopy(spectrogram)
    for i in range(n_iter):
    X_t = invert_spectrogram(X_best)
    est = librosa.stft(X_t, n_fft, hop_length, win_length=win_length)
    phase = est / np.maximum(1e-8, np.abs(est))
    X_best = spectrogram * phase
    X_t = invert_spectrogram(X_best)
    y = np.real(X_t)

    return y


    def invert_spectrogram(spectrogram):
    '''
    spectrogram: [f, t]
    '''
    return librosa.istft(spectrogram, hop_length, win_length=win_length, window="hann")

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    预加重:
    语音信号的平均功率谱受声门激励和口鼻辐射影响,高频端约在800HZ以上按6dB/倍频程衰落,预加重的目的是提升高频成分,使信号频谱平坦化,以便于频谱分析或声道参数分析.
    ---------------------

  • 相关阅读:
    hdu 5645 DZY Loves Balls
    idea2016的使用心得 --- 太棒了
    20170410 --- Linux备课资料 --- 压缩与解压缩
    20170410 --- Linux备课资料 --- vim的使用
    mysql20170410练习代码+笔记
    你说你有多坑?----超市项目错误总结
    说好的不熬夜呢???!!!! -- 超市项目
    那些年搞不懂的高深术语——依赖倒置•控制反转•依赖注入•面向接口编程
    今天思考一个问题 --- 自己的强项是什么??
    sleep()和wait()的区别 --- 快入睡了,突然想起一个知识点,搞完就睡
  • 原文地址:https://www.cnblogs.com/ly570/p/11198597.html
Copyright © 2011-2022 走看看