一个中国小姐姐讲python自然语言处理的课程,真的讲得太好了,强烈推荐。
一、概述
数据科学应该掌握的三种技能:

数据的两种格式:


二、英语
profanity n. 亵圣; 对神灵的亵渎; (亵圣的) 诅咒语
corpus n. (书面或口语的) 文集,文献,汇编; 语料库;
Dreyfus model 德雷福斯模型
potty train 对(幼儿)作坐便训练 n.(幼儿的) 便盆 adj. 发疯的; 癫狂的; 喜爱; 对…痴迷
三、代码
#抓取p标签内容
# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
'''Returns transcript data specifically from scrapsfromtheloft.com.'''
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p.text for p in soup.find(class_="post-content").find_all('p')]
print(url)
return text
#生成字典
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}
#对文字进行处理
import re
import string
def clean_text_round1(text):
'''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
text = text.lower()
text = re.sub('\[.*?\]', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\w*\d\w*', '', text)
return text
round1 = lambda x: clean_text_round1(x)
# Apply a second round of cleaning
def clean_text_round2(text):
'''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
text = re.sub('[‘’“”…]', '', text)
text = re.sub('\n', '', text)
return text
round2 = lambda x: clean_text_round2(x)
https://www.youtube.com/watch?v=xvqsFTUsOmc
https://github.com/adashofdata/nlp-in-python-tutorial
四、获得数据的地方
