zoukankan      html  css  js  c++  java
  • 爬虫与Python:(四)爬虫进阶扩展之Pandas——6.JSON化

    JSON(JavaScript Object Notation,JavaScript 对象表示法),是存储和交换文本信息的语法,类似 XML。

    Pandas 可以很方便的处理 JSON 数据。

    读取JSON数据

    假设site.json文件内容如下:

    [
       {
       "id": "A001",
       "name": "菜鸟教程",
       "url": "www.runoob.com",
       "likes": 61
       },
       {
       "id": "A002",
       "name": "Google",
       "url": "www.google.com",
       "likes": 124
       },
       {
       "id": "A003",
       "name": "淘宝",
       "url": "www.taobao.com",
       "likes": 45
       }
    ]

    简单的读取JSON内容示例如下:

    1 import pandas as pd
    2 
    3 df = pd.read_json('sites.json')
    4 print(df.to_string())   #to_string() 用于返回 DataFrame 类型的数据,我们也可以直接处理 JSON 字符串

    以上示例输出结果为:

         id    name             url  likes
    0  A001    菜鸟教程  www.runoob.com     61
    1  A002  Google  www.google.com    124
    2  A003      淘宝  www.taobao.com     45

    将Python字典转换为DataFrame

    JSON 对象与 Python 字典具有相同的格式,所以我们可以直接将 Python 字典转化为 DataFrame 数据:

     1 import pandas as pd
     2 
     3 # 字典格式的 JSON
     4 s = {
     5     "col1":{"row1":1,"row2":2,"row3":3},
     6     "col2":{"row1":"x","row2":"y","row3":"z"}
     7 }
     8 # 读取 JSON 转为 DataFrame
     9 df = pd.DataFrame(s)
    10 print(df)

    从URL中读取JSON数据

    从URL读取JSON的代码如下:

    import pandas as pd
    
    URL = 'https://static.runoob.com/download/sites.json'
    df = pd.read_json(URL)
    print(df)

    内嵌的JSON数据

    假设有一组内嵌的 JSON 数据文件 nested_list.json

    {
        "school_name": "ABC primary school",
        "class": "Year 1",
        "students": [
        {
            "id": "A001",
            "name": "Tom",
            "math": 60,
            "physics": 66,
            "chemistry": 61
        },
        {
            "id": "A002",
            "name": "James",
            "math": 89,
            "physics": 76,
            "chemistry": 51
        },
        {
            "id": "A003",
            "name": "Jenny",
            "math": 79,
            "physics": 90,
            "chemistry": 78
        }]
    }

    使用以下代码格式读取完整的内容:

    import pandas as pd
    
    df = pd.read_json('nested_list.json')
    print(df)

    以上示例输出结果为:

              school_name  ...                                           students
    0  ABC primary school  ...  {'id': 'A001', 'name': 'Tom', 'math': 60, 'phy...
    1  ABC primary school  ...  {'id': 'A002', 'name': 'James', 'math': 89, 'p...
    2  ABC primary school  ...  {'id': 'A003', 'name': 'Jenny', 'math': 79, 'p...

    这时我们就需要使用到 json_normalize() 方法将内嵌的数据完整的解析出来:

    import pandas as pd
    import json
    
    # 使用 Python JSON 模块载入数据
    with open('nested_list.json','r') as f:
        data = json.loads(f.read())
    
    # 展平数据
    df_nested_list = pd.json_normalize(data, record_path =['students'])
    print(df_nested_list)

    以上示例输出结果为:

         id   name  math  physics  chemistry
    0  A001    Tom    60       66         61
    1  A002  James    89       76         51
    2  A003  Jenny    79       90         78

    data = json.loads(f.read()) 使用 Python JSON 模块载入数据。

    json_normalize() 使用了参数 record_path 并设置为 ['students'] 用于展开内嵌的 JSON 数据 students

    显示结果还没有包含 school_name 和 class 元素,如果需要展示出来可以使用 meta 参数来显示这些元数据:

     1 import pandas as pd
     2 import json
     3 
     4 # 使用 Python JSON 模块载入数据
     5 with open('nested_list.json','r') as f:
     6     data = json.loads(f.read())
     7 
     8 # 展平数据
     9 df_nested_list = pd.json_normalize(
    10     data,
    11     record_path =['students'],
    12     meta=['school_name', 'class']
    13 )
    14 print(df_nested_list)

    以上实例输出结果为:

         id   name  math  physics  chemistry         school_name   class
    0  A001    Tom    60       66         61  ABC primary school  Year 1
    1  A002  James    89       76         51  ABC primary school  Year 1
    2  A003  Jenny    79       90         78  ABC primary school  Year 1

    接下来,让我们尝试读取更复杂的 JSON 数据,该数据嵌套了列表和字典,数据文件 nested_mix.json 如下

    {
        "school_name": "local primary school",
        "class": "Year 1",
        "info": {
          "president": "John Kasich",
          "address": "ABC road, London, UK",
          "contacts": {
            "email": "admin@e.com",
            "tel": "123456789"
          }
        },
        "students": [
        {
            "id": "A001",
            "name": "Tom",
            "math": 60,
            "physics": 66,
            "chemistry": 61
        },
        {
            "id": "A002",
            "name": "James",
            "math": 89,
            "physics": 76,
            "chemistry": 51
        },
        {
            "id": "A003",
            "name": "Jenny",
            "math": 79,
            "physics": 90,
            "chemistry": 78
        }]
    }
    View Code

    nested_mix.json 文件转换为 DataFrame:

    import pandas as pd
    import json
    
    # 使用 Python JSON 模块载入数据
    with open('nested_mix.json', 'r') as f:
        data = json.loads(f.read())
    
    df = pd.json_normalize(
        data,
        record_path=['students'],
        meta=[
            'class',
            ['info', 'president'],
            ['info', 'contacts', 'tel']
        ]
    )
    
    print(df)

    以上示例输出结果为:

         id   name  math  physics  chemistry   class info.president info.contacts.tel
    0  A001    Tom    60       66         61  Year 1    John Kasich         123456789
    1  A002  James    89       76         51  Year 1    John Kasich         123456789
    2  A003  Jenny    79       90         78  Year 1    John Kasich         123456789

    读取内嵌数据中的一组数据

    以下是实例文件 nested_deep.json,我们只读取内嵌中的 math 字段:

    {
        "school_name": "local primary school",
        "class": "Year 1",
        "students": [
        {
            "id": "A001",
            "name": "Tom",
            "grade": {
                "math": 60,
                "physics": 66,
                "chemistry": 61
            }
     
        },
        {
            "id": "A002",
            "name": "James",
            "grade": {
                "math": 89,
                "physics": 76,
                "chemistry": 51
            }
           
        },
        {
            "id": "A003",
            "name": "Jenny",
            "grade": {
                "math": 79,
                "physics": 90,
                "chemistry": 78
            }
        }]
    }
    View Code

    这里我们需要使用到 glom 模块来处理数据套嵌,glom 模块允许我们使用 . 来访问内嵌对象的属性。

    第一次使用我们需要安装 glom:

    pip3 install glom

    示例代码如下:

    1 import pandas as pd
    2 from glom import glom
    3 
    4 df = pd.read_json('nested_deep.json')
    5 
    6 data = df['students'].apply(lambda row: glom(row, 'grade.math'))
    7 print(data)

    以上实例输出结果为:

    0    60
    1    89
    2    79
    Name: students, dtype: int64

    参考网址

    有志者,事竟成,破釜沉舟,百二秦关终属楚; 苦心人,天不负,卧薪尝胆,三千越甲可吞吴。
  • 相关阅读:
    安装docker (centos7.6)
    idea docker docker-compose发布springboot站点到tomcat
    tomcat启动慢的解决办法
    skywalking6.3.0安装(centos7.6)
    RocketMQ集群安装 2主2从 console
    RocketMQ官方启动脚本不支持jdk11的修改
    python在windows上创建虚拟环境
    JVM之Java运行时数据区(线程共享区)
    Dijkstra算法和Floyd算法
    JVM之Java运行时数据区(线程隔离区)
  • 原文地址:https://www.cnblogs.com/luyj00436/p/15472484.html
Copyright © 2011-2022 走看看