ownthink开源了其知识图谱数据,用来作为构建知识图谱的基础数据是很好的选择。笔者选用了neo4j作为知识图谱的数据存储。
结果
- 因为后面的数据导入过程实在是太长,所以先上个效果图。
MATCH (n) RETURN n LIMIT 2000
* 诸葛亮的性别是男
* 你还能看出什么别的呢?
环境安装
- 安装neo4j
- 安装python3(也可以用shell处理数据,不过还是算了吧)
- 安装go(数据处理工具,有现成的,也可以自己写python)
数据准备
从百度网盘或者官网下载 https://www.ownthink.com/docs/kg/#_1
解压得到的csv文件
wc -l ownthink_v2.csv #可以看到其数据量
- zip文件应该是1.95G
- 解压后的csv文件有8G多
解压及处理后前20行如下:
实体,属性,值 胶饴,描述,别名: 饴糖、畅糖、畅、软糖。 词条,描述,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。 词条,标签,文化 红色食品,描述,红色食品是指食品为红色、橙红色或棕红色的食品。 红色食品,中文名,红色食品 红色食品,是否含防腐剂,否 红色食品,主要食用功效,预防感冒,缓解疲劳 红色食品,适宜人群,全部人群 红色食品,用途,增强表皮细胞再生和防止皮肤衰老 红色食品,标签,非科学 红色食品,标签,生活 大龙湫,描述,雁荡山景区分散,东起羊角洞,西至锯板岭;南起筋竹溪,北至六坪山。 大龙湫,中文名称,大龙湫 大龙湫,外文名称,big dragon autrum 大龙湫,地理位置,浙江省温州市雁荡山景区 大龙湫,开放时间,08:00~18:00 大龙湫,门票价格,50元 大龙湫,著名景点,芙蓉峰 大龙湫,著名景点,剪刀峰
数据处理&清洗
说明
基于上面的数据分析,这里的知识图谱数据是三元组,实体,属性,值(也可以处理为实体)。用图的模型来分析,应该将他们清洗为两个点和一个边的关系才好。
当然更进一步,应该分析出来,哪些应该作为属性,哪些应该作为关系。
比如我不应该把姚明的性别“男”作为一个实体。但是应该把他的老婆“叶莉”作为实体。
数据结构转换
- 将三元组数据转换为实体关系
- 下载 https://github.com/jievince/rdf-converter
- 重新编译根目录下
go build
- 处理数据
./rdf-converter –path /path/to/ownthink_v2.csv
- 结果数据应该是4000万左右的实体数据和1.4亿左右的边的数据。
vertex.csv
-2469395383949115281,过度包装 -5567206714840433083,Over Package 3836323934884101628,有的商品故意增加包装层数 1185893106173039861,很多采用实木、金属制品 3455734391170888430,非科学 9183164258636124946,教育 5258679239570815125,成熟市场 -8062106589304861485,"成熟市场是指低增长率,高占有率的市场。"
edge.csv
3413383836870836248,-948987595135324087,含义 3413383836870836248,8037179844375033188,定义 3413383836870836248,-2559124418148243756,标签 3413383836870836248,8108596883039039864,标签 2587975790775251569,-4666568475926279810,描述 2587975790775251569,2587975790775251569,中文名称 2587975790775251569,3771551033890875715,外文名称 2587975790775251569,2900555761857775043,地理位置 2587975790775251569,-1913521037799946160,占地面积 2587975790775251569,-1374607753051283066,开放时间
数据格式转换
- 之前数据清洗的工具是照着Nebula Graph 开发的,所以要改造成neo4j可导入的格式,还需要弄一下neo4j。参考ownthink格式转换
python格式转换代码
def prep_vertex_all(): # 注意替换成自己的路径 ferror = open("kg/kg-clean/err_vertex.csv",'w') frname = "kg/kg-clean/vertex.csv" fwname = "kg/kg-clean/vertex_output_vertex_all.csv" with open(frname, 'r') as fr: with open(fwname, 'w') as fw: fw.write("{},{},{}\n".format(":ID", "name", ":LABEL")) for line in fr: try: # print(line.strip()) line = line.strip() if not line: continue spo = line.split(",") # print(spo) fw.write("{},{},{}\n".format(spo[0], spo[1].replace('"',''), "ENTITY")) except: ferror.write("{}\n".format(line)) continue def prep_edge_all(): ferror = open("kg/kg-clean/err_edge.csv",'w') frname = "kg/kg-clean/edge.csv" fwname = "kg/kg-clean/edge_output_all.csv" print(frname) print(fwname) with open(frname, 'r') as fr: with open(fwname, 'w') as fw: fw.write("{},{},{},{}\n".format(":START_ID", "name", ":END_ID", ":TYPE")) for line in fr: try: # print(line.strip()) line = line.strip() if not line: continue spo = line.split(",") # print(spo) fw.write("{},{},{},{}\n".format(spo[0], spo[2].replace('"', ''), spo[1], "RELATIONSHIP")) except: ferror.write("{}\n".format(line)) continue if __name__ == '__main__': prep_vertex_all() prep_edge_all()
格式转换后的csv文件
实体
:ID,name,:LABEL -201035082963479683,实体,ENTITY -1779678833482502384,值,ENTITY 4646408208538057683,胶饴,ENTITY -1861609733419239066,别名: 饴糖、畅糖、畅、软糖。,ENTITY -2047289935702608120,词条,ENTITY 5842706712819643509,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。,ENTITY -3063129772935425027,文化,ENTITY -2484942249444426630,红色食品,ENTITY -3877061284769534378,红色食品是指食品为红色、橙红色或棕红色的食品。,ENTITY -3402450096279275143,否,ENTITY 4786182067583989997,预防感冒,缓解疲劳,ENTITY -8978611301755314833,全部人群,ENTITY -382812815618074210,增强表皮细胞再生和防止皮肤衰老,ENTITY 3455734391170888430,非科学,ENTITY -4368442157131186527,生活,ENTITY -4016848910133347272,大龙湫,ENTITY -1751058806841876591,雁荡山景区分散,东起羊角洞,西至锯板岭;南起筋竹溪,北至六坪山。,ENTITY -4369745808943528904,big dragon autrum,ENTITY -3278556255913778158,浙江省温州市雁荡山景区,ENTITY
关系
:START_ID,name,:END_ID,:TYPE -201035082963479683,属性,-1779678833482502384,RELATIONSHIP 4646408208538057683,描述,-1861609733419239066,RELATIONSHIP -2047289935702608120,描述,5842706712819643509,RELATIONSHIP -2047289935702608120,标签,-3063129772935425027,RELATIONSHIP -2484942249444426630,描述,-3877061284769534378,RELATIONSHIP -2484942249444426630,中文名,-2484942249444426630,RELATIONSHIP -2484942249444426630,是否含防腐剂,-3402450096279275143,RELATIONSHIP -2484942249444426630,主要食用功效,4786182067583989997,RELATIONSHIP -2484942249444426630,适宜人群,-8978611301755314833,RELATIONSHIP -2484942249444426630,用途,-382812815618074210,RELATIONSHIP -2484942249444426630,标签,3455734391170888430,RELATIONSHIP -2484942249444426630,标签,-4368442157131186527,RELATIONSHIP -4016848910133347272,描述,-1751058806841876591,RELATIONSHIP -4016848910133347272,中文名称,-4016848910133347272,RELATIONSHIP -4016848910133347272,外文名称,-4369745808943528904,RELATIONSHIP -4016848910133347272,地理位置,-3278556255913778158,RELATIONSHIP -4016848910133347272,开放时间,-1081363081064284954,RELATIONSHIP -4016848910133347272,门票价格,3797530799472559859,RELATIONSHIP -4016848910133347272,著名景点,6249183780323029504,RELATIONSHIP
数据导入neo4j
使用neo4j-admin import指令来导入
./bin/neo4j-admin import –database=graph.db –mode=csv –nodes /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/vertex_output_vertex_all.csv –relationships /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/edge_output_all.csv –ignore-duplicate-nodes=true –ignore-missing-nodes=true –id-type=string
导入的执行过程如下
- Tips中途有几次卡死的情况,我每次都是重新处理的。但是考虑到我们的数据量实在是太大,占用内存很多,所以其实可以安静的等待结果
Neo4j version: 3.5.12 Importing the contents of these files into /Users/shengl/2-sys-ai/neo4j-3.5.12/data/databases/graph.db: Nodes: /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/vertex_output_vertex_all.csv Relationships: /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/edge_output_all.csv Available resources: Total machine memory: 8.00 GB Free machine memory: 443.01 MB Max heap memory : 1.78 GB Processors: 4 Configured max memory: 5.60 GB High-IO: true WARNING: 300.88 MB memory may not be sufficient to complete this import. Suggested memory distribution is: heap size: 1.02 GB minimum free and available memory excluding heap size: 2.52 GBImport starting 2020-06-21 00:56:14.222+0800 Estimated number of nodes: 124.08 M Estimated number of node properties: 124.04 M Estimated number of relationships: 153.90 M Estimated number of relationship properties: 153.88 M Estimated disk space usage: 23.27 GB Estimated required memory usage: 2.52 GB InteractiveReporterInteractions command list (end with ENTER): c: Print more detailed information about current stage i: Print more detailed information (1/4) Node import 2020-06-21 00:56:14.318+0800 Estimated number of nodes: 124.08 M Estimated disk space usage: 12.52 GB Estimated required memory usage: 2.52 GB .......... .......... .......... .......... .......... 5% ∆59s 400ms .......... .......... .......... .......... .......... 10% ∆44s 67ms .......... .......... .......... .......... .......... 15% ∆40s 843ms .......... .......... .......... .......... .......-.. 20% ∆1s 846ms .......... .......... .......... .......... .......... 25% ∆0ms .......... .......... .......... .......... .......... 30% ∆21s 900ms .......... .......... .......... .......... .......... 35% ∆24s 483ms .......... .......... .......... .......... .......... 40% ∆21s 689ms .......... .......... .......... .......... .......... 45% ∆3s 624ms .......... .......... .......... .......... .......... 50% ∆9s 70ms .......... .......... .......... .......... .......... 55% ∆19s 469ms .......... .......... .......... .......... .......... 60% ∆22s 82ms .......... .......... .......... .......... .......... 65% ∆26s 136ms .......... .......... .......... .......... .......... 70% ∆43s 238ms .......... .......... .......... .......... .......... 75% ∆1m 21s 497ms .......... ........ -. .......... .......... .......... 80% ∆12m 31s 254ms .......... .......... .......... .......... .......... 85% ∆8ms .......... .......... .......... .......... .......... 90% ∆0ms .......... .......... .......... .......... .......... 95% ∆0ms .......... .......... .......... .......... .........(2/4) Relationship import 2020-06-21 01:16:45.024+0800 Estimated number of relationships: 153.90 M Estimated disk space usage: 10.75 GB Estimated required memory usage: 3.19 GB .......... .......... .......... .......... .......... 5% ∆3m 13s 542ms .......... .......... .......... .......... .......... 10% ∆3m 11s 870ms .......... .......... .......... .......... .......... 15% ∆3m 19s 289ms .......... .......... .......... .......... .......... 20% ∆3m 9s 639ms .......... .......... .......... .......... .......... 25% ∆3m 950ms .......... .......... .......... .......... .......... 30% ∆2m 55s 403ms .......... .......... .......... .......... .......... 35% ∆2m 59s 21ms .......... .......... .......... .......... .......... 40% ∆2m 32s 129ms .......... .......... .......... .......... .......... 45% ∆2m 34s 393ms .......... .......... .......... .......... .......... 50% ∆2m 41s 458ms .......... .......... .......... .......... .......... 55% ∆2m 52s 905ms .......... .......... .......... .......... .......... 60% ∆2m 8s 285ms .......... .......... .......... .......... .......... 65% ∆2m 9s 676ms .......... .......... .......... .......... .......... 70% ∆2m 10s 811ms .......... .......... .......... .......... .......... 75% ∆2m 2s 33ms .......... .......... .......... .......... .......... 80% ∆3m 55s 170ms .......... .......... .......... .......... .......... 85% ∆3m 31s 936ms .......... .......... .......... .......... .......... 90% ∆3m 18s 954ms .......... .......... .......... .......... .......... 95% ∆51s 591ms .......... .......... .......... .......... .......... 100% ∆1ms (3/4) Relationship linking 2020-06-21 02:09:24.090+0800 Estimated required memory usage: 1.46 GB .......... .......... .......... .......... .......... 5% ∆20s 682ms .......... .......... .......... .......... .......... 10% ∆30s 169ms .......... .......... .......... .......... .......... 15% ∆29s 796ms .......... .......... .......... .......... .........- 20% ∆202ms .......... .......... .......... .......... .......... 25% ∆6s 441ms .......... .......... .......... .......... .......... 30% ∆6s 818ms .......... .......... .......... .......... .......... 35% ∆12s 849ms .......... .......... .......... .......... .......... 40% ∆5s 14ms .......... .......... .......... .......... .......... 45% ∆16s 472ms .......... .......... .......... .......... .......... 50% ∆14s 857ms .......... .......... .......... .......... .......... 55% ∆24s 711ms .......... .......... .......... .......... ........-. 60% ∆331ms .......... .......... .......... .......... .......... 65% ∆5s 810ms .......... .......... .......... .......... .......... 70% ∆7s 29ms .......... .......... .......... .......... .......... 75% ∆15s 47ms .......... .......... .......... .......... .......... 80% ∆7s 229ms .......... .......... .......... .......... .......... 85% ∆20s 888ms .......... .......... .......... .......... .......... 90% ∆26s 123ms .......... .......... .......... .......... .......... 95% ∆15s 461ms .......... .......... .......... .......... .......... 100% ∆23s 694ms (4/4) Post processing 2020-06-21 02:15:32.752+0800 Estimated required memory usage: 1020.01 MB -......... .......... .......... .......... .......... 5% ∆5s 435ms .......... .......... .......... .......... .......... 10% ∆2s 612ms .......... .......... .......... .......... .......... 15% ∆3s 407ms .......... .......... .......... .......... .......... 20% ∆2s 9ms .......... .......... .......... .......... .......... 25% ∆2s 10ms .......... .......... .......... .......... .......... 30% ∆1s 423ms .......... .......... .......... .......... .......... 35% ∆1s 609ms .......... .......... .......... .......... ......-... 40% ∆349ms .......... .......... .......... .......... .......... 45% ∆1s 809ms .......... .......... .......... .......... .......... 50% ∆2s 5ms .......... .......... .......... .......... .......... 55% ∆4s 214ms .......... .......... .......... .......... .......... 60% ∆2s 207ms .......... .......... .......... .......... .......... 65% ∆6s 821ms .......... .......... .......... .......... .......... 70% ∆4s 7ms .......... .......... .......... .......... .......... 75% ∆2s 612ms .......... .......... .......... .......... .......... 80% ∆2s 610ms .......... .......... .......... .......... .......... 85% ∆4s 8ms .......... .......... .......... .......... .......... 90% ∆2s 2ms .......... .......... .......... .......... .......... 95% ∆2s 804ms .......... .......... .......... .......... .......... 100% ∆1s 807ms IMPORT DONE in 1h 20m 44s 293ms. Imported: 45464809 nodes 139951301 relationships 185317766 properties Peak memory usage: 3.19 GB There were bad entries which were skipped and logged into /Users/shengl/2-sys-ai/neo4j-3.5.12/import.report
待改进
- 最后效果可以看文章开头的结果。
- 如何识别其中的属性值(非实体的值,比如身高175这种),可能要专门写数据处理程序来处理。
- 参考文献
版权声明
本文标题:101-使用neo4j构建ownthink知识图谱
文章作者:盛领
发布时间:2020年06月21日 - 11:47:57
原始链接:http://blog.xiaoyuyu.net/post/294dd39b.html
许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。
如您有任何商业合作或者授权方面的协商,请给我留言:sunsetxiao@126.com
