N-gram 相似度计算与评估
创始人
2024-11-12 08:08:25
0

示例文本

  • 参考文本: “Natural language processing is very interesting.”
  • 生成文本: “Natural language processing is quite interesting.”

1. 提取 N-gram

1.1. Unigram(1-gram)

参考文本的Unigram:

["Natural", "language", "processing", "is", "very", "interesting"] 

生成文本的Unigram:

["Natural", "language", "processing", "is", "quite", "interesting"] 

1.2. Bigram(2-gram)

参考文本的Bigram:

["Natural language", "language processing", "processing is", "is very", "very interesting"] 

生成文本的Bigram:

["Natural language", "language processing", "processing is", "is quite", "quite interesting"] 

2. 计算 N-gram 重叠

2.1. Unigram 重叠

重叠的Unigram:

["Natural", "language", "processing", "is", "interesting"] 

重叠的Unigram数量:5

总参考文本Unigram数量:6

总生成文本Unigram数量:6

2.2. Bigram 重叠

重叠的Bigram:

["Natural language", "language processing", "processing is"] 

重叠的Bigram数量:3

总参考文本Bigram数量:5

总生成文本Bigram数量:5

3. 计算评估指标

3.1. 精确率(Precision)

  • Unigram 精确率
    Precision Unigram = 重叠的Unigram数量 生成文本中的总Unigram数量 = 5 6 ≈ 0.83 \text{Precision}_{\text{Unigram}} = \frac{\text{重叠的Unigram数量}}{\text{生成文本中的总Unigram数量}} = \frac{5}{6} \approx 0.83 PrecisionUnigram​=生成文本中的总Unigram数量重叠的Unigram数量​=65​≈0.83

  • Bigram 精确率
    Precision Bigram = 重叠的Bigram数量 生成文本中的总Bigram数量 = 3 5 = 0.6 \text{Precision}_{\text{Bigram}} = \frac{\text{重叠的Bigram数量}}{\text{生成文本中的总Bigram数量}} = \frac{3}{5} = 0.6 PrecisionBigram​=生成文本中的总Bigram数量重叠的Bigram数量​=53​=0.6

3.2. 召回率(Recall)

  • Unigram 召回率
    Recall Unigram = 重叠的Unigram数量 参考文本中的总Unigram数量 = 5 6 ≈ 0.83 \text{Recall}_{\text{Unigram}} = \frac{\text{重叠的Unigram数量}}{\text{参考文本中的总Unigram数量}} = \frac{5}{6} \approx 0.83 RecallUnigram​=参考文本中的总Unigram数量重叠的Unigram数量​=65​≈0.83

  • Bigram 召回率
    Recall Bigram = 重叠的Bigram数量 参考文本中的总Bigram数量 = 3 5 = 0.6 \text{Recall}_{\text{Bigram}} = \frac{\text{重叠的Bigram数量}}{\text{参考文本中的总Bigram数量}} = \frac{3}{5} = 0.6 RecallBigram​=参考文本中的总Bigram数量重叠的Bigram数量​=53​=0.6

3.3. F1 分数(F1 Score)

  • Unigram F1分数
    F1 Unigram = 2 × Precision Unigram × Recall Unigram Precision Unigram + Recall Unigram = 2 × 0.83 × 0.83 0.83 + 0.83 ≈ 0.83 \text{F1}_{\text{Unigram}} = 2 \times \frac{\text{Precision}_{\text{Unigram}} \times \text{Recall}_{\text{Unigram}}}{\text{Precision}_{\text{Unigram}} + \text{Recall}_{\text{Unigram}}} = 2 \times \frac{0.83 \times 0.83}{0.83 + 0.83} \approx 0.83 F1Unigram​=2×PrecisionUnigram​+RecallUnigram​PrecisionUnigram​×RecallUnigram​​=2×0.83+0.830.83×0.83​≈0.83

  • Bigram F1分数
    F1 Bigram = 2 × Precision Bigram × Recall Bigram Precision Bigram + Recall Bigram = 2 × 0.6 × 0.6 0.6 + 0.6 = 0.6 \text{F1}_{\text{Bigram}} = 2 \times \frac{\text{Precision}_{\text{Bigram}} \times \text{Recall}_{\text{Bigram}}}{\text{Precision}_{\text{Bigram}} + \text{Recall}_{\text{Bigram}}} = 2 \times \frac{0.6 \times 0.6}{0.6 + 0.6} = 0.6 F1Bigram​=2×PrecisionBigram​+RecallBigram​PrecisionBigram​×RecallBigram​​=2×0.6+0.60.6×0.6​=0.6

4. Python 代码实现

from collections import Counter from typing import List  def extract_ngrams(text: str, n: int) -> List[str]:     words = text.split()     return [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]  def calculate_ngram_overlap(reference: str, generated: str, n: int):     reference_ngrams = extract_ngrams(reference, n)     generated_ngrams = extract_ngrams(generated, n)          reference_counter = Counter(reference_ngrams)     generated_counter = Counter(generated_ngrams)          overlapping_ngrams = set(reference_counter.keys()) & set(generated_counter.keys())          overlap_count = sum(min(reference_counter[ngram], generated_counter[ngram]) for ngram in overlapping_ngrams)     precision = overlap_count / len(generated_ngrams) if generated_ngrams else 0     recall = overlap_count / len(reference_ngrams) if reference_ngrams else 0     f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) else 0          return precision, recall, f1_score  reference_text = "Natural language processing is very interesting." generated_text = "Natural language processing is quite interesting."  # 计算Unigram(1-gram) precision_unigram, recall_unigram, f1_unigram = calculate_ngram_overlap(reference_text, generated_text, 1) print(f"Unigram - 精确率: {precision_unigram:.2f}, 召回率: {recall_unigram:.2f}, F1分数: {f1_unigram:.2f}")  # 计算Bigram(2-gram) precision_bigram, recall_bigram, f1_bigram = calculate_ngram_overlap(reference_text, generated_text, 2) print(f"Bigram - 精确率: {precision_bigram:.2f}, 召回率: {recall_bigram:.2f}, F1分数: {f1_bigram:.2f}") 

Code

完整示例代码已上传至:Machine Learning and Deep Learning Algorithms with NumPy
此项目包含更多AI相关的算法numpy实现,供大家学习参考使用,欢迎star~

备注

个人水平有限,有问题随时交流~ 

相关内容

热门资讯

相较于以往!好友赣南辅助,赣牌... 相较于以往!好友赣南辅助,赣牌圈小程序破解版(原来真的有平台)-哔哩哔哩1、首先打开赣牌圈小程序破解...
今天下午!免费闲逸辅助器(免费... 今天下午!免费闲逸辅助器(免费),aapoker怎么设置抽水(透视)确实存在有辅助方法(哔哩哔哩)1...
透视私人局"新永和源... 透视私人局"新永和源代码"竟然真的是有辅助插件(哔哩哔哩)一、新永和源代码游戏安装教程牌型概率发牌机...
更值得关注的是!微信中至赣牌圈... 更值得关注的是!微信中至赣牌圈免费开挂,wpk安卓下载辅助(透视)确实真的是有辅助方法(哔哩哔哩)1...
2026版教学"随意... 2026版教学"随意玩免费辅助器有挂吗"切实确实有辅助挂(哔哩哔哩)亲,关键说明,随意玩免费辅助器有...
最新消息!凑一桌关春天游戏辅助... 最新消息!凑一桌关春天游戏辅助苹果版,随意玩第三方辅助(竟然是真的脚本)-哔哩哔哩1、这是跨平台的凑...
透视了解"心悦踢坑辅... 透视了解"心悦踢坑辅助器最新版本是多少"其实确实有辅助器(哔哩哔哩)1、操作简单,无需心悦踢坑辅助器...
相较于以往!凑一桌辅助软件免费... 相较于以往!凑一桌辅助软件免费,菠萝德普辅助器免费版在哪里(透视)都是真的是有辅助攻略(哔哩哔哩)在...
经核实!邳州友友辅助软件,微信... 经核实!邳州友友辅助软件,微信新九五辅助(总是存在有修改器)-哔哩哔哩1、邳州友友辅助软件有没有辅助...
透视ai代打"丰城双... 透视ai代打"丰城双剑新版最强高分攻略"竟然存在有辅助工具(哔哩哔哩)1、用户打开应用后不用登录就可...