python处理文本中的标点问题

2024-12-16 17:43:31

推荐回答（1个）

回答1：

收集了所有的英文标点跟常用的中文标点来做判断. 目前程序输入的a.txt需要是utf8编码的, 如果你用的是其他编码格式, 把最后一行的utf8改成你自己用的编码格式应该就可以了.

#! coding: utf8
from __future__ import unicode_literals
import re
non_stops = (
    '＂＃＄％＆＇（）＊＋，－'
    '／：；＜＝＞＠［＼］＾＿'
    '｀｛｜｝～｟｠'
    '｢｣､'
    '　、〃'
    '》「」『』【】'
    '〔〕〖〗〘〙〚〛〜〝〞〟'
    '〰'
    '〾〿'
    '–—'
    '‘’‛“”„‟'
    '…‧'
    '﹏'
)
stops = (
    '！'
    '？'
    '｡'
    '。'
)

punctuation = non_stops + stops
punctuation += '!-/:-@[-`{-~'
r = re.compile('[{}]'.format(punctuation))
fin = open('a.txt', 'rb')
fout = open('b.txt', 'wb')
[fout.write(e) for e in fin if not r.search(e.decode('utf8'))]