zoukankan      html  css  js  c++  java
  • Python for Data Science

    Chapter 6 - Data Sourcing via Web

    Segment 4 - Web scraping

    from bs4 import BeautifulSoup
    import urllib.request
    from IPython.display import HTML
    import re
    
    r = urllib.request.urlopen('https://analytics.usa.gov/').read()
    soup = BeautifulSoup(r, "lxml")
    type(soup)
    
    bs4.BeautifulSoup
    
    print(soup.prettify()[:100])
    
    <!DOCTYPE html>
    <html lang="en">
     <!-- Initalize title and data source variables -->
     <head>
      <!--
    
    for link in soup.find_all('a'):
        print(link.get('href'))
    
    /
    #explanation
    https://analytics.usa.gov/data/
    https://open.gsa.gov/api/dap/
    data/
    #top-pages-realtime
    #top-pages-7-days
    #top-pages-30-days
    https://analytics.usa.gov/data/live/all-pages-realtime.csv
    https://analytics.usa.gov/data/live/all-domains-30-days.csv
    https://www.digitalgov.gov/services/dap/
    https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
    https://support.google.com/analytics/answer/2763052?hl=en
    https://analytics.usa.gov/data/live/second-level-domains.csv
    https://analytics.usa.gov/data/live/sites.csv
    mailto:DAP@support.digitalgov.gov
    https://analytics.usa.gov/data/
    https://open.gsa.gov/api/dap/
    mailto:DAP@support.digitalgov.gov
    https://github.com/GSA/analytics.usa.gov/issues
    https://github.com/GSA/analytics.usa.gov
    https://github.com/18F/analytics-reporter
    http://www.gsa.gov/
    https://www.digitalgov.gov/services/dap/
    https://cloud.gov/
    
    print(soup.get_text())
    














    analytics.usa.gov | The US government's web traffic.
    





















    analytics.usa.gov
    


    About this site
    Data | API
    


    Select an agency
    
    All Participating Websites
    Agency for International Development
    Department of Agriculture
    Department of Commerce
    Department of Defense
    Department of Education
    Department of Energy
    Department of Health and Human Services
    Department of Homeland Security
    Department of Housing and Urban Development
    Department of Justice
    Department of Labor
    Department of State
    Department of Transportation
    Department of Veterans Affairs
    Department of the Interior
    Department of the Treasury
    Environmental Protection Agency
    Executive Office of the President
    General Services Administration
    National Aeronautics and Space Administration
    National Archives and Records Administration
    National Science Foundation
    Nuclear Regulatory Commission
    Office of Personnel Management
    Postal Service
    Small Business Administration
    Social Security Administration
    








    ...
    people on government websites now
    

    Visits Today
    Eastern Time
    





    Visits in the Past 90 Days
    

              There were ... visits over the past 90 days.
    

    Devices
    




                Based on rough network segmentation data, we estimate that less than 5% of all traffic across all agencies comes from US federal government networks.
    

                Much more detailed data is available in downloadable CSV and JSON. This includes data on combined browser and OS usage.
    


    Browsers
    




    Internet Explorer
    




    Operating Systems
    




    Windows
    






    Visitor Locations Right Now
    

    Cities
    





    Countries
    




    United States & Territories
    



    International
    







    Top Pages
    
    Now
    7 Days
    30 Days
    


                  People on a single, specific page now. We only count pages with at least 10 people on the page.
                  Download the full dataset.
    




    Visits over the last week to domains, including traffic to all pages within that domain.
    




                  Visits over the last month to domains, including traffic to all pages within that domain. We only count pages with at least 1,000 visits in the last month.
                  Download the full dataset.
    





    Top Downloads
    Total file downloads yesterday on government domains.
    







    About this Site
    
                These data provide a window into how people are interacting with the government online.
                 The data come from a unified Google Analytics account for U.S. federal government agencies known as the Digital Analytics Program.
                  This program helps government agencies understand how people find, access, and use government services online. The program does not track individuals,
                   and anonymizes the IP addresses of visitors.
    

                Not every government website is represented in these data. 
                Currently, the Digital Analytics Program collects web traffic from around 400 executive branch government domains,
                 across about 5,700 total websites,
                  including every cabinet department.
                   We continue to pursue and add more sites frequently; to add your site, email the Digital Analytics Program.
    


    Download the data
    You can download the data here. Available in JSON and CSV format.
     Additionally, you can access data via our  API project (currently in Beta).
    A note on sampling
    Due to varying Google Analytics API sampling thresholds and the sheer volume of data in this project, some non-realtime reports may be subject to sampling. 
                 The data are intended to represent trends and numbers may not be precise.
    





    Have a question or problem? 
                  
                  Get in touch.
    


                      Suggest a feature or report an issue
    




                  View our code on GitHub
    

                  View our code for the data on GitHub
    









    Analytics.usa.gov is a project of GSA’s Digital Analytics Program.
    This website is hosted on cloud.gov.
    











    print(soup.prettify()[0:1000])
    
    <!DOCTYPE html>
    <html lang="en">
     <!-- Initalize title and data source variables -->
     <head>
      <!--
    
        Hi! Welcome to our source code.
    
        This dashboard uses data from the Digital Analytics Program, a US
        government team inside the General Services Administration.
    

        For a detailed tech breakdown of how 18F and friends built this site:
    
        https://18f.gsa.gov/2015/03/19/how-we-built-analytics-usa-gov/
    

        This is a fully open source project, and your contributions are welcome.
    
        Frontend static site: https://github.com/18F/analytics.usa.gov
        Backend data reporting: https://github.com/18F/analytics-reporter
    
        -->
      <meta charset="utf-8"/>
      <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
      <meta content="NjbZn6hQe7OwV-nTsa6nLmtrOUcSGPRyFjxm5zkmCcg" name="google-site-verification"/>
      <link href="/css/vendor/css/uswds.v0.9.6.css" rel="stylesheet"/>
      <link href="/css/public_analytics.css" rel="stylesheet"/>
      <link href="/images/analytics-favicon.ico" rel="ic
    
    for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
        print(link)
    type(link)
    
    <a href="https://analytics.usa.gov/data/">Data</a>
    <a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank">API</a>
    <a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
    <a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
    <a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
    <a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
    <a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
    <a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
    <a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
    <a href="https://analytics.usa.gov/data/">download the data here.</a>
    <a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank"> API project</a>
    <a class="usa-button usa-button-secondary-inverse" href="https://github.com/GSA/analytics.usa.gov/issues">
    <img alt="Github Icon" class="github-icon" src="/images/github-logo-white.svg"/>
                      Suggest a feature or report an issue
                </a>
    <a href="https://github.com/GSA/analytics.usa.gov">
    <img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
                  View our code on GitHub</a>
    <a href="https://github.com/18F/analytics-reporter">
    <img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
                  View our code for the data on GitHub</a>
    <a href="http://www.gsa.gov/">
    <img alt="GSA" src="/images/gsa-logo.svg"/>
    </a>
    <a href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
    <a href="https://cloud.gov/">cloud.gov</a>
    
    
    
    
    
    bs4.element.Tag
    
    file = open("parsed_data.txt", "w")
    for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
        soup_link = str(link)
        print(soup_link)
        file.write(soup_link)
    file.flush()
    file.close()
    
    <a href="https://analytics.usa.gov/data/">Data</a>
    <a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank">API</a>
    <a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
    <a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
    <a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
    <a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
    <a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
    <a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
    <a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
    <a href="https://analytics.usa.gov/data/">download the data here.</a>
    <a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank"> API project</a>
    <a class="usa-button usa-button-secondary-inverse" href="https://github.com/GSA/analytics.usa.gov/issues">
    <img alt="Github Icon" class="github-icon" src="/images/github-logo-white.svg"/>
                      Suggest a feature or report an issue
                </a>
    <a href="https://github.com/GSA/analytics.usa.gov">
    <img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
                  View our code on GitHub</a>
    <a href="https://github.com/18F/analytics-reporter">
    <img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
                  View our code for the data on GitHub</a>
    <a href="http://www.gsa.gov/">
    <img alt="GSA" src="/images/gsa-logo.svg"/>
    </a>
    <a href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
    <a href="https://cloud.gov/">cloud.gov</a>
    
    %pwd
    
    '/home/ericwei/Ex_Files_Python_Data_Science_EssT_Pt_1/Exercise Files/06_04_begin'
  • 相关阅读:
    AutoLayout相关
    Xcode
    ios 如何更改包名
    xcode 没有 iphone 模拟器
    使用asi请求的步骤
    caseInsensitiveCompare : 不区分大小写的 字符串比较
    封装的网路请求类
    石材网..搜索时无法正常返回数据 转码问题 (URL中用到的编码解码问题)
    谈一下我对于指针的理解
    IOS面试题--004
  • 原文地址:https://www.cnblogs.com/keepmoving1113/p/14286921.html
Copyright © 2011-2022 走看看