利用beautifulsoup4解析Kindle笔记

zoukankan html css js c++ java

利用beautifulsoup4解析Kindle笔记
Varietyankikindle小书匠笔记管理
目录

1  需求说明
2  系统环境
3  利用app版kindle导出读书笔记
4  解析html笔记
4.1  解析书籍基本信息
4.2  解析书籍笔记
5  应用html笔记
5.1  追加至kindle笔记管理文件My Clippings.txt
5.2  适配成Variety箴言
5.3  匹配成anki笔记模式


1.需求说明

拥有Kindle Paperwhite 3 ( KPW3 )设备，平常会在KPW3、android手机、ipad及电脑等多端设备阅读电子书，阅读过程中会对书籍标记、做笔记，比较奇怪的是KPW3上的标记、笔记能同步到其他终端上，反过来虽然可以同步到KPW3上，但是标注及笔记无法记录到My Clippings.txt，以至于无法进一步加工处理读书笔记，所以利用android手机的kindle的笔记导出功能，将一本书籍的所有笔记以html导出，进一步解析合并至My Clippings.txt及处理到Variety、anki等应用上。

2.系统环境

#系统环境
!lsb_release -a

No LSB modules are available.
Distributor ID: LinuxMint
Description: Linux Mint 19.3 Tricia
Release: 19.3
Codename: tricia

#Python及相关库版本
!python --version
!python -m pip list --format=columns | grep beautifulsoup4
!python -m pip list --format=columns | grep lxml

Python 3.6.9
beautifulsoup4 4.9.0
lxml 4.5.0

3.利用app版kindle导出读书笔记

确保android上的Kindle笔记已经完整（可能出现手机1笔记完整，手机2只有一个字，如下图，这种情况只能定位过去再标记一遍）

多端同步出现笔记不一致问题

笔记完整后，使用无格式导出笔记，导出流程如下图：

第一步第二步

笔记导出后，效果如下图：

导出的kindle笔记


4.解析html笔记


4.1解析书籍基本信息

#导入库
import re
from bs4 import BeautifulSoup
from lxml.html.clean import unicode

#创建Beautifulsoup对象
soup=BeautifulSoup(open('./demo.html'),features='html.parser')

#获取书籍名称及作者
bookname=soup.find_all('div',class_='bookTitle')[0].text.strip()
authors=soup.find_all('div',class_='authors')[0].text.strip()
print(bookname,authors)

拆掉思维里的墙:原来我还可以这样活古典


4.2解析书籍笔记

#所有笔记内容
allcontents=soup.contents[3].contents[3].contents[1]

#遍历所有笔记内容
allnotes=[]
takenoteflag=False
for conind in range(11,len(allcontents)):
content=BeautifulSoup(unicode(allcontents.contents[conind]))
if len(content)==0:
continue
if conind==11:
note={'sectionHeading':'','noteHeading':{'markColor':'','markPosition':''},'noteText':'','takenote':{'takePosition':'','note':''}}
#根据css样式区分内容
div=content.select('div')
divclass=div[0].get("class")[0]
#笔记所处章节
if divclass=='sectionHeading':
note['sectionHeading']=content.text.strip().replace(' ','')
#笔记样式
elif divclass=='noteHeading':
if takenoteflag:
markpos=re.findall(r'd+',content.text.strip().replace(' ',''))[0]
note['takenote']['markPosition']=markpos
else:
markclo=content.span.text.strip().replace(' ','')
markpos=re.findall(r'd+',content.text.strip().replace(' ',''))[0]
note['noteHeading']['markColor']=markclo
note['noteHeading']['markPosition']=markpos
#自己做了笔记
elif divclass=='noteText' and takenoteflag:
note['takenote']['note']=content.text.strip().replace(' ','')
takenoteflag=False
allnotes.append(note)
note={'sectionHeading':note['sectionHeading'],'noteHeading':{'markColor':'','markPosition':''},'noteText':'','takenote':{'takePosition':'','note':''}}
#仅仅是标记笔记
elif divclass=='noteText' and not takenoteflag:
note['noteText']=content.text.strip().replace(' ','')

#判断后续是否有笔记
strtind=1
nextnote=BeautifulSoup(unicode(allcontents.contents[conind+strtind]))
while len(nextnote)==0 and (conind+strtind)<len(allcontents):
nextnote=BeautifulSoup(unicode(allcontents.contents[conind+strtind]))
strtind+=1
if '笔记' in nextnote.text.strip().replace(' ',''):
takenoteflag=True
else:
allnotes.append(note)
note={'sectionHeading':note['sectionHeading'],'noteHeading':{'markColor':'','markPosition':''},'noteText':'','takenote':{'takePosition':'','note':''}}
# print(allnotes)

5.应用html笔记


5.1追加至kindle笔记管理文件My Clippings.txt

解析了笔记内容，按照My Clippings.txt文件中的标记、笔记格式，将导出笔记内容追加至My Clippings.txt，笔记合并后，可利用现有的诸如clippings.io、书见等工具进行笔记管理。

注意：由于导出笔记不含时间信息，因此至获取当前系统时间作为笔记时间，该时间非真实做笔记时间

#获取当前时间
import time
def Getnowdate():
week_day_dict = {
0 : '星期一',
1 : '星期二',
2 : '星期三',
3 : '星期四',
4 : '星期五',
5 : '星期六',
6 : '星期天',
}
loctime=time.localtime()
years=time.strftime("%Y年%-m月%-d日", loctime)
weeks=week_day_dict[loctime[6]]

if loctime[3]<=12:
times=time.strftime("上午%-H:%-M:%S", loctime)
else:
times='下午'+time.localtime()[3]-12+time.strftime(":%M:%S", loctime)
nowdate=years+weeks+' '+times
return nowdate

#读入已做的笔记
existnotes=open('My Clippings.txt','r').readlines()

#写入文件
fw=open('My Clippings.txt','a')
for noteind in range(0,len(allnotes)):
if allnotes[noteind]['takenote']['note']!='':

if (allnotes[noteind]['noteText'].replace(' ','')+' ') not in existnotes:
fw.write(bookname+' ('+authors+') ')
fw.write('- 您在位置 #'+allnotes[noteind]['noteHeading']['markPosition']+'-'+str(int(allnotes[noteind]['noteHeading']['markPosition'])+1)+' 的标注'+' | 添加于 '+Getnowdate()+' ')
fw.write(allnotes[noteind]['noteText'].replace(' ','')+' ')
fw.write('========== ')

if (allnotes[noteind]['takenote']['note'].replace(' ','')+' ') not in existnotes:
fw.write(bookname+' ('+authors+') ')
fw.write('- 您在位置 #'+allnotes[noteind]['noteHeading']['markPosition']+' 的笔记'+' | 添加于 '+Getnowdate()+' ')
fw.write(allnotes[noteind]['takenote']['note']+' ')
fw.write('========== ')

else:
if (allnotes[noteind]['noteText'].replace(' ','')+' ') not in existnotes:
fw.write(bookname+' ('+authors+') ')
fw.write('- 您在位置 #'+allnotes[noteind]['noteHeading']['markPosition']+'-'+str(int(allnotes[noteind]['noteHeading']['markPosition'])+1)+' 的标注'+' | 添加于 '+Getnowdate()+' ')
fw.write(allnotes[noteind]['noteText'].replace(' ','')+' ')
fw.write('========== ')
fw.close()

将读书笔记追加至My Clippings.txt

5.2适配成Variety箴言

Variety是linux下的壁纸管理工具，具备使用本地文档显示箴言的功能，现将kindle笔记解析成Variety识别的格式，并展示出来，方便日常查看。

#读入已做的笔记
#处理已添加的箴言
def Delline(line):
lastind=0
if '[' in line:
lastind=line.index('[')
return line[:lastind]

existnotes=list(map(Delline,open('/home/wu/.config/variety/pluginconfig/quotes/qotes.txt','r').readlines()))

#写入文件
fw=open('qotes.txt','w')
for noteind in range(0,len(allnotes)):
if allnotes[noteind]['noteText'].replace(' ','') not in existnotes:
fw.write(allnotes[noteind]['noteText'].replace(' ','')+'['+allnotes[noteind]['sectionHeading'].replace(' ','')+']'+'——'+bookname+' ('+authors+') ')
if allnotes[noteind]['takenote']['note'].replace(' ','')!='':
fw.write('#'+allnotes[noteind]['takenote']['note'].replace(' ','')+'——@'+'WuShaogui ')
fw.write('. ')
fw.close()

解析后的文档 Variety配置

Variety箴言显示效果


5.3匹配成anki笔记模式

anki是背书神器，将kindle笔记导入anki中，可以对一本书的笔记进行反复的练习，加深感悟！

#写入anki笔记导入格式
fw=open('%s-%s.txt'%(bookname,authors),'w')
for noteind in range(0,len(allnotes)):
fw.write(allnotes[noteind]['noteText'].replace(' ','')+' '
+allnotes[noteind]['sectionHeading'].replace(' ','')+' '
+bookname+' '+authors+' '+allnotes[noteind]['takenote']['note'].replace(' ','')+' ')
fw.close()

解析后的文档 Anki导入解析后文档

文档导入后效果最终效果图
查看全文

相关阅读:
WEBSHELL跳板REDUH使用说明
 lcx.exe内网转发命令教程 + LCX免杀下载
 程序只启动一个实例的几种方法
 VS2010中遇到_WIN32_WINNT not defined
VC编译错误: Nafxcwd.lib(dllmodul.obj) : error LNK2005: _DllMain@12已经在dllmain.obj 中定义
 python（31） enumerate 的用法
 利用余弦定理计算文本的相似度
 Linux命令(24) ：sort
python（30）获取网页返回的状态码，状态码对应问题查询
 python（29）强大的zip函数