小皮博客 | Xiaopi's Blog

101-使用neo4j构建ownthink知识图谱

ownthink开源了其知识图谱数据,用来作为构建知识图谱的基础数据是很好的选择。笔者选用了neo4j作为知识图谱的数据存储。

结果

  • 因为后面的数据导入过程实在是太长,所以先上个效果图。

MATCH (n) RETURN n LIMIT 2000

* 诸葛亮的性别是男
* 你还能看出什么别的呢?

环境安装

  • 安装neo4j
  • 安装python3(也可以用shell处理数据,不过还是算了吧)
  • 安装go(数据处理工具,有现成的,也可以自己写python)

数据准备

从百度网盘或者官网下载 https://www.ownthink.com/docs/kg/#_1

解压得到的csv文件
wc -l ownthink_v2.csv #可以看到其数据量

  • zip文件应该是1.95G
  • 解压后的csv文件有8G多

解压及处理后前20行如下:

实体,属性,值
胶饴,描述,别名: 饴糖、畅糖、畅、软糖。
词条,描述,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。
词条,标签,文化
红色食品,描述,红色食品是指食品为红色、橙红色或棕红色的食品。
红色食品,中文名,红色食品
红色食品,是否含防腐剂,否
红色食品,主要食用功效,预防感冒,缓解疲劳
红色食品,适宜人群,全部人群
红色食品,用途,增强表皮细胞再生和防止皮肤衰老
红色食品,标签,非科学
红色食品,标签,生活
大龙湫,描述,雁荡山景区分散,东起羊角洞,西至锯板岭;南起筋竹溪,北至六坪山。
大龙湫,中文名称,大龙湫
大龙湫,外文名称,big dragon autrum
大龙湫,地理位置,浙江省温州市雁荡山景区
大龙湫,开放时间,08:00~18:00
大龙湫,门票价格,50元
大龙湫,著名景点,芙蓉峰
大龙湫,著名景点,剪刀峰

数据处理&清洗

说明

基于上面的数据分析,这里的知识图谱数据是三元组,实体,属性,值(也可以处理为实体)。用图的模型来分析,应该将他们清洗为两个点和一个边的关系才好。
当然更进一步,应该分析出来,哪些应该作为属性,哪些应该作为关系。
比如我不应该把姚明的性别“男”作为一个实体。但是应该把他的老婆“叶莉”作为实体。

数据结构转换

go build

  • 处理数据

./rdf-converter –path /path/to/ownthink_v2.csv

  • 结果数据应该是4000万左右的实体数据和1.4亿左右的边的数据。
  • vertex.csv

    -2469395383949115281,过度包装
    -5567206714840433083,Over  Package
    3836323934884101628,有的商品故意增加包装层数
    1185893106173039861,很多采用实木、金属制品
    3455734391170888430,非科学
    9183164258636124946,教育
    5258679239570815125,成熟市场
    -8062106589304861485,"成熟市场是指低增长率,高占有率的市场。"
    
  • edge.csv

    3413383836870836248,-948987595135324087,含义
    3413383836870836248,8037179844375033188,定义
    3413383836870836248,-2559124418148243756,标签
    3413383836870836248,8108596883039039864,标签
    2587975790775251569,-4666568475926279810,描述
    2587975790775251569,2587975790775251569,中文名称
    2587975790775251569,3771551033890875715,外文名称
    2587975790775251569,2900555761857775043,地理位置
    2587975790775251569,-1913521037799946160,占地面积
    2587975790775251569,-1374607753051283066,开放时间
    

数据格式转换

  • 之前数据清洗的工具是照着Nebula Graph 开发的,所以要改造成neo4j可导入的格式,还需要弄一下neo4j。参考ownthink格式转换

python格式转换代码

def prep_vertex_all():
    # 注意替换成自己的路径
    ferror = open("kg/kg-clean/err_vertex.csv",'w')
    frname = "kg/kg-clean/vertex.csv"
    fwname = "kg/kg-clean/vertex_output_vertex_all.csv"
    with open(frname, 'r') as fr:
        with open(fwname, 'w') as fw:
            fw.write("{},{},{}\n".format(":ID", "name", ":LABEL"))
            for line in fr:
                try:
                    # print(line.strip())
                    line = line.strip()
                    if not line:
                        continue
                    spo = line.split(",")
                    # print(spo)
                    fw.write("{},{},{}\n".format(spo[0], spo[1].replace('"',''), "ENTITY"))
                except:
                    ferror.write("{}\n".format(line))
                    continue

def prep_edge_all():
    ferror = open("kg/kg-clean/err_edge.csv",'w')
    frname = "kg/kg-clean/edge.csv"
    fwname = "kg/kg-clean/edge_output_all.csv"
    print(frname)
    print(fwname)
    with open(frname, 'r') as fr:
        with open(fwname, 'w') as fw:
            fw.write("{},{},{},{}\n".format(":START_ID", "name", ":END_ID", ":TYPE"))
            for line in fr:
                try:
                    # print(line.strip())
                    line = line.strip()
                    if not line:
                        continue
                    spo = line.split(",")
                    # print(spo)
                    fw.write("{},{},{},{}\n".format(spo[0], spo[2].replace('"', ''), spo[1], "RELATIONSHIP"))
                except:
                    ferror.write("{}\n".format(line))
                    continue


if __name__ == '__main__':
    prep_vertex_all()
    prep_edge_all()

格式转换后的csv文件

  • 实体

    :ID,name,:LABEL
    -201035082963479683,实体,ENTITY
    -1779678833482502384,值,ENTITY
    4646408208538057683,胶饴,ENTITY
    -1861609733419239066,别名: 饴糖、畅糖、畅、软糖。,ENTITY
    -2047289935702608120,词条,ENTITY
    5842706712819643509,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。,ENTITY
    -3063129772935425027,文化,ENTITY
    -2484942249444426630,红色食品,ENTITY
    -3877061284769534378,红色食品是指食品为红色、橙红色或棕红色的食品。,ENTITY
    -3402450096279275143,否,ENTITY
    4786182067583989997,预防感冒,缓解疲劳,ENTITY
    -8978611301755314833,全部人群,ENTITY
    -382812815618074210,增强表皮细胞再生和防止皮肤衰老,ENTITY
    3455734391170888430,非科学,ENTITY
    -4368442157131186527,生活,ENTITY
    -4016848910133347272,大龙湫,ENTITY
    -1751058806841876591,雁荡山景区分散,东起羊角洞,西至锯板岭;南起筋竹溪,北至六坪山。,ENTITY
    -4369745808943528904,big dragon autrum,ENTITY
    -3278556255913778158,浙江省温州市雁荡山景区,ENTITY
    
  • 关系

    :START_ID,name,:END_ID,:TYPE
    -201035082963479683,属性,-1779678833482502384,RELATIONSHIP
    4646408208538057683,描述,-1861609733419239066,RELATIONSHIP
    -2047289935702608120,描述,5842706712819643509,RELATIONSHIP
    -2047289935702608120,标签,-3063129772935425027,RELATIONSHIP
    -2484942249444426630,描述,-3877061284769534378,RELATIONSHIP
    -2484942249444426630,中文名,-2484942249444426630,RELATIONSHIP
    -2484942249444426630,是否含防腐剂,-3402450096279275143,RELATIONSHIP
    -2484942249444426630,主要食用功效,4786182067583989997,RELATIONSHIP
    -2484942249444426630,适宜人群,-8978611301755314833,RELATIONSHIP
    -2484942249444426630,用途,-382812815618074210,RELATIONSHIP
    -2484942249444426630,标签,3455734391170888430,RELATIONSHIP
    -2484942249444426630,标签,-4368442157131186527,RELATIONSHIP
    -4016848910133347272,描述,-1751058806841876591,RELATIONSHIP
    -4016848910133347272,中文名称,-4016848910133347272,RELATIONSHIP
    -4016848910133347272,外文名称,-4369745808943528904,RELATIONSHIP
    -4016848910133347272,地理位置,-3278556255913778158,RELATIONSHIP
    -4016848910133347272,开放时间,-1081363081064284954,RELATIONSHIP
    -4016848910133347272,门票价格,3797530799472559859,RELATIONSHIP
    -4016848910133347272,著名景点,6249183780323029504,RELATIONSHIP
    

数据导入neo4j

  • 使用neo4j-admin import指令来导入

    ./bin/neo4j-admin import –database=graph.db –mode=csv –nodes /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/vertex_output_vertex_all.csv –relationships /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/edge_output_all.csv –ignore-duplicate-nodes=true –ignore-missing-nodes=true –id-type=string

  • 导入的执行过程如下

  • Tips中途有几次卡死的情况,我每次都是重新处理的。但是考虑到我们的数据量实在是太大,占用内存很多,所以其实可以安静的等待结果
Neo4j version: 3.5.12
Importing the contents of these files into /Users/shengl/2-sys-ai/neo4j-3.5.12/data/databases/graph.db:
Nodes:
  /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/vertex_output_vertex_all.csv
Relationships:
  /Users/shengl/24-xiaoling/1600-data/01-Archived/01-cleandata/rdf-converter/edge_output_all.csv

Available resources:
  Total machine memory: 8.00 GB
  Free machine memory: 443.01 MB
  Max heap memory : 1.78 GB
  Processors: 4
  Configured max memory: 5.60 GB
  High-IO: true

WARNING: 300.88 MB memory may not be sufficient to complete this import. Suggested memory distribution is:
heap size: 1.02 GB
minimum free and available memory excluding heap size: 2.52 GBImport starting 2020-06-21 00:56:14.222+0800
  Estimated number of nodes: 124.08 M
  Estimated number of node properties: 124.04 M
  Estimated number of relationships: 153.90 M
  Estimated number of relationship properties: 153.88 M
  Estimated disk space usage: 23.27 GB
  Estimated required memory usage: 2.52 GB

InteractiveReporterInteractions command list (end with ENTER):
  c: Print more detailed information about current stage
  i: Print more detailed information

(1/4) Node import 2020-06-21 00:56:14.318+0800
  Estimated number of nodes: 124.08 M
  Estimated disk space usage: 12.52 GB
  Estimated required memory usage: 2.52 GB
.......... .......... .......... .......... ..........   5% ∆59s 400ms
.......... .......... .......... .......... ..........  10% ∆44s 67ms
.......... .......... .......... .......... ..........  15% ∆40s 843ms
.......... .......... .......... .......... .......-..  20% ∆1s 846ms
.......... .......... .......... .......... ..........  25% ∆0ms
.......... .......... .......... .......... ..........  30% ∆21s 900ms
.......... .......... .......... .......... ..........  35% ∆24s 483ms
.......... .......... .......... .......... ..........  40% ∆21s 689ms
.......... .......... .......... .......... ..........  45% ∆3s 624ms
.......... .......... .......... .......... ..........  50% ∆9s 70ms
.......... .......... .......... .......... ..........  55% ∆19s 469ms
.......... .......... .......... .......... ..........  60% ∆22s 82ms
.......... .......... .......... .......... ..........  65% ∆26s 136ms
.......... .......... .......... .......... ..........  70% ∆43s 238ms
.......... .......... .......... .......... ..........  75% ∆1m 21s 497ms
.......... ........







-. .......... .......... ..........  80% ∆12m 31s 254ms
.......... .......... .......... .......... ..........  85% ∆8ms
.......... .......... .......... .......... ..........  90% ∆0ms
.......... .......... .......... .......... ..........  95% ∆0ms
.......... .......... .......... .......... .........(2/4) Relationship import 2020-06-21 01:16:45.024+0800
  Estimated number of relationships: 153.90 M
  Estimated disk space usage: 10.75 GB
  Estimated required memory usage: 3.19 GB
.......... .......... .......... .......... ..........   5% ∆3m 13s 542ms
.......... .......... .......... .......... ..........  10% ∆3m 11s 870ms
.......... .......... .......... .......... ..........  15% ∆3m 19s 289ms
.......... .......... .......... .......... ..........  20% ∆3m 9s 639ms
.......... .......... .......... .......... ..........  25% ∆3m 950ms
.......... .......... .......... .......... ..........  30% ∆2m 55s 403ms
.......... .......... .......... .......... ..........  35% ∆2m 59s 21ms
.......... .......... .......... .......... ..........  40% ∆2m 32s 129ms
.......... .......... .......... .......... ..........  45% ∆2m 34s 393ms
.......... .......... .......... .......... ..........  50% ∆2m 41s 458ms
.......... .......... .......... .......... ..........  55% ∆2m 52s 905ms
.......... .......... .......... .......... ..........  60% ∆2m 8s 285ms
.......... .......... .......... .......... ..........  65% ∆2m 9s 676ms
.......... .......... .......... .......... ..........  70% ∆2m 10s 811ms
.......... .......... .......... .......... ..........  75% ∆2m 2s 33ms
.......... .......... .......... .......... ..........  80% ∆3m 55s 170ms
.......... .......... .......... .......... ..........  85% ∆3m 31s 936ms
.......... .......... .......... .......... ..........  90% ∆3m 18s 954ms
.......... .......... .......... .......... ..........  95% ∆51s 591ms
.......... .......... .......... .......... .......... 100% ∆1ms

(3/4) Relationship linking 2020-06-21 02:09:24.090+0800
  Estimated required memory usage: 1.46 GB
.......... .......... .......... .......... ..........   5% ∆20s 682ms
.......... .......... .......... .......... ..........  10% ∆30s 169ms
.......... .......... .......... .......... ..........  15% ∆29s 796ms
.......... .......... .......... .......... .........-  20% ∆202ms
.......... .......... .......... .......... ..........  25% ∆6s 441ms
.......... .......... .......... .......... ..........  30% ∆6s 818ms
.......... .......... .......... .......... ..........  35% ∆12s 849ms
.......... .......... .......... .......... ..........  40% ∆5s 14ms
.......... .......... .......... .......... ..........  45% ∆16s 472ms
.......... .......... .......... .......... ..........  50% ∆14s 857ms
.......... .......... .......... .......... ..........  55% ∆24s 711ms
.......... .......... .......... .......... ........-.  60% ∆331ms
.......... .......... .......... .......... ..........  65% ∆5s 810ms
.......... .......... .......... .......... ..........  70% ∆7s 29ms
.......... .......... .......... .......... ..........  75% ∆15s 47ms
.......... .......... .......... .......... ..........  80% ∆7s 229ms
.......... .......... .......... .......... ..........  85% ∆20s 888ms
.......... .......... .......... .......... ..........  90% ∆26s 123ms
.......... .......... .......... .......... ..........  95% ∆15s 461ms
.......... .......... .......... .......... .......... 100% ∆23s 694ms

(4/4) Post processing 2020-06-21 02:15:32.752+0800
  Estimated required memory usage: 1020.01 MB
-......... .......... .......... .......... ..........   5% ∆5s 435ms
.......... .......... .......... .......... ..........  10% ∆2s 612ms
.......... .......... .......... .......... ..........  15% ∆3s 407ms
.......... .......... .......... .......... ..........  20% ∆2s 9ms
.......... .......... .......... .......... ..........  25% ∆2s 10ms
.......... .......... .......... .......... ..........  30% ∆1s 423ms
.......... .......... .......... .......... ..........  35% ∆1s 609ms
.......... .......... .......... .......... ......-...  40% ∆349ms
.......... .......... .......... .......... ..........  45% ∆1s 809ms
.......... .......... .......... .......... ..........  50% ∆2s 5ms
.......... .......... .......... .......... ..........  55% ∆4s 214ms
.......... .......... .......... .......... ..........  60% ∆2s 207ms
.......... .......... .......... .......... ..........  65% ∆6s 821ms
.......... .......... .......... .......... ..........  70% ∆4s 7ms
.......... .......... .......... .......... ..........  75% ∆2s 612ms
.......... .......... .......... .......... ..........  80% ∆2s 610ms
.......... .......... .......... .......... ..........  85% ∆4s 8ms
.......... .......... .......... .......... ..........  90% ∆2s 2ms
.......... .......... .......... .......... ..........  95% ∆2s 804ms
.......... .......... .......... .......... .......... 100% ∆1s 807ms


IMPORT DONE in 1h 20m 44s 293ms. 
Imported:
  45464809 nodes
  139951301 relationships
  185317766 properties
Peak memory usage: 3.19 GB
There were bad entries which were skipped and logged into /Users/shengl/2-sys-ai/neo4j-3.5.12/import.report

待改进

版权声明

本文标题:101-使用neo4j构建ownthink知识图谱

文章作者:盛领

发布时间:2020年06月21日 - 11:47:57

原始链接:http://blog.xiaoyuyu.net/post/294dd39b.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

如您有任何商业合作或者授权方面的协商,请给我留言:sunsetxiao@126.com

盛领 wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!