zoukankan      html  css  js  c++  java
  • SQL文摘:Writing Subqueries in SQL

    原文地址:https://community.modeanalytics.com/sql/tutorial/sql-subqueries/

    Subquery basics

    Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you’d need to do each aggregation in a distinct step.

    Subqueries can be used in several places within a query, but it’s easiest to start with the FROM statement. Here’s an example of a basic subquery:

    SELECT sub.*
      FROM (
            SELECT *
              FROM tutorial.sf_crime_incidents_2014_01
             WHERE day_of_week = 'Friday'
           ) sub
     WHERE sub.resolution = 'NONE'
    

    Let’s break down what happens when you run the above query:

    First, the database runs the “inner query”—the part between the parentheses:

    SELECT *
      FROM tutorial.sf_crime_incidents_2014_01
     WHERE day_of_week = 'Friday'
    

    If you were to run this on its own, it would produce a result set like any other query. It might sound like a no-brainer, but it’s important: your inner query must actually run on its own, as the database will treat it as an independent query. Once the inner query runs, the outer query will run using the results from the inner query as its underlying table:

    SELECT sub.*
      FROM (
           <<results from inner query go here>>
           ) sub
     WHERE sub.resolution = 'NONE'
    

    Subqueries are required to have names, which are added after parentheses the same way you would add an alias to a normal table. In this case, we’ve used the name “sub.”

    A quick note on formatting: The important thing to remember when using subqueries is to provide some way to for the reader to easily determine which parts of the query will be executed together. Most people do this by indenting the subquery in some way. The examples in this tutorial are indented quite far—all the way to the parentheses. This isn’t practical if you nest many subqueries, so it’s fairly common to only indent two spaces or so.

    Write a query that selects all Warrant Arrests from the tutorial.sf_crime_incidents_2014_01 dataset, then wrap it in an outer query that only displays unresolved incidents.

    The above examples, as well as the practice problem don’t really require subqueries—they solve problems that could also be solved by adding multiple conditions to the WHERE clause. These next sections provide examples for which subqueries are the best or only way to solve their respective problems.

    Using subqueries to aggregate in multiple stages

    What if you wanted to figure out how many incidents get reported on each day of the week? Better yet, what if you wanted to know how many incidents happen, on average, on a Friday in December? In January? There are two steps to this process: counting the number of incidents each day (inner query), then determining the monthly average (outer query):

    SELECT LEFT(sub.date, 2) AS cleaned_month,
           sub.day_of_week,
           AVG(sub.incidents) AS average_incidents
      FROM (
            SELECT day_of_week,
                   date,
                   COUNT(incidnt_num) AS incidents
              FROM tutorial.sf_crime_incidents_2014_01
             GROUP BY 1,2
           ) sub
     GROUP BY 1,2
     ORDER BY 1,2
    

    If you’re having trouble figuring out what’s happening, try running the inner query individually to get a sense of what its results look like. In general, it’s easiest to write inner queries first and revise them until the results make sense to you, then to move on to the outer query.

    Write a query that displays the average number of monthly incidents for each category. Hint: use tutorial.sf_crime_incidents_cleandate to make your life a little easier.

    Subqueries in conditional logic

    You can use subqueries in conditional logic (in conjunction with WHEREJOIN/ON, or CASE). The following query returns all of the entries from the earliest date in the dataset (theoretically—the poor formatting of the date column actually makes it return the value that sorts first alphabetically):

    SELECT *
      FROM tutorial.sf_crime_incidents_2014_01
     WHERE Date = (SELECT MIN(date)
                     FROM tutorial.sf_crime_incidents_2014_01
                  )
    

    The above query works because the result of the subquery is only one cell. Most conditional logic will work with subqueries containing one-cell results. However, INis the only type of conditional logic that will work when the inner query contains multiple results:

    SELECT *
      FROM tutorial.sf_crime_incidents_2014_01
     WHERE Date IN (SELECT date
                     FROM tutorial.sf_crime_incidents_2014_01
                    ORDER BY date
                    LIMIT 5
                  )
    

    Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN case) rather than as a table.

    Joining subqueries

    You may remember that you can filter queries in joins. It’s fairly common to join a subquery that hits the same table as the outer query rather than filtering in the WHEREclause. The following query produces the same results as the previous example:

    SELECT *
      FROM tutorial.sf_crime_incidents_2014_01 incidents
      JOIN ( SELECT date
               FROM tutorial.sf_crime_incidents_2014_01
              ORDER BY date
              LIMIT 5
           ) sub
        ON incidents.date = sub.date
    

    This can be particularly useful when combined with aggregations. When you join, the requirements for your subquery output aren’t as stringent as when you use the WHEREclause. For example, your inner query can output multiple results. The following query ranks all of the results according to how many incidents were reported in a given day. It does this by aggregating the total number of incidents each day in the inner query, then using those values to sort the outer query:

    SELECT incidents.*,
           sub.incidents AS incidents_that_day
      FROM tutorial.sf_crime_incidents_2014_01 incidents
      JOIN ( SELECT date,
              COUNT(incidnt_num) AS incidents
               FROM tutorial.sf_crime_incidents_2014_01
              GROUP BY 1
           ) sub
        ON incidents.date = sub.date
     ORDER BY sub.incidents DESC, time
    
    Write a query that displays all rows from the three categories with the fewest incidents reported.

    Subqueries can be very helpful in improving the performance of your queries. Let’s revisit the Crunchbase Data briefly. Imagine you’d like to aggregate all of the companies receiving investment and companies acquired each month. You could do that without subqueries if you wanted to, but don’t actually run this as it will take minutes to return:

    SELECT COALESCE(acquisitions.acquired_month, investments.funded_month) AS month,
           COUNT(DISTINCT acquisitions.company_permalink) AS companies_acquired,
           COUNT(DISTINCT investments.company_permalink) AS investments
      FROM tutorial.crunchbase_acquisitions acquisitions
      FULL JOIN tutorial.crunchbase_investments investments
        ON acquisitions.acquired_month = investments.funded_month
     GROUP BY 1
    

    Note that in order to do this properly, you must join on date fields, which causes a massive “data explosion.” Basically, what happens is that you’re joining every row in a given month from one table onto every month in a given row on the other table, so the number of rows returned is incredibly great. Because of this multiplicative effect, you must use COUNT(DISTINCT) instead of COUNT to get accurate counts. You can see this below:

    The following query shows 7,414 rows:

    SELECT COUNT(*) FROM tutorial.crunchbase_acquisitions
    

    The following query shows 83,893 rows:

    SELECT COUNT(*) FROM tutorial.crunchbase_investments
    

    The following query shows 6,237,396 rows:

        SELECT COUNT(*)
          FROM tutorial.crunchbase_acquisitions acquisitions
          FULL JOIN tutorial.crunchbase_investments investments
            ON acquisitions.acquired_month = investments.funded_month
    

    If you’d like to understand this a little better, you can do some extra research on cartesian products. It’s also worth noting that the FULL JOIN and COUNT above actually runs pretty fast—it’s the COUNT(DISTINCT) that takes forever. More on that in the lesson on optimizing queries.

    Of course, you could solve this much more efficiently by aggregating the two tables separately, then joining them together so that the counts are performed across far smaller datasets:

    SELECT COALESCE(acquisitions.month, investments.month) AS month,
           acquisitions.companies_acquired,
           investments.companies_rec_investment
      FROM (
            SELECT acquired_month AS month,
                   COUNT(DISTINCT company_permalink) AS companies_acquired
              FROM tutorial.crunchbase_acquisitions
             GROUP BY 1
           ) acquisitions
    
      FULL JOIN (
            SELECT funded_month AS month,
                   COUNT(DISTINCT company_permalink) AS companies_rec_investment
              FROM tutorial.crunchbase_investments
             GROUP BY 1
           )investments
    
        ON acquisitions.month = investments.month
     ORDER BY 1 DESC
    

    Note: We used a FULL JOIN above just in case one table had observations in a month that the other table didn’t. We also used COALESCE to display months when the acquisitions subquery didn’t have month entries (presumably no acquisitions occurred in those months). We strongly encourage you to re-run the query without some of these elements to better understand how they work. You can also run each of the subqueries independently to get a better understanding of them as well.

    Write a query that counts the number of companies founded and acquired by quarter starting in Q1 2012. Create the aggregations in two separate queries, then join them.

    Subqueries and UNIONs

    For this next section, we will borrow directly from the lesson on UNIONs—again using the Crunchbase data:

    SELECT *
      FROM tutorial.crunchbase_investments_part1
    
     UNION ALL
    
     SELECT *
       FROM tutorial.crunchbase_investments_part2
    

    It’s certainly not uncommon for a dataset to come split into several parts, especially if the data passed through Excel at any point (Excel can only handle ~1M rows per spreadsheet). The two tables used above can be thought of as different parts of the same dataset—what you’d almost certainly like to do is perform operations on the entire combined dataset rather than on the individual parts. You can do this by using a subquery:

    SELECT COUNT(*) AS total_rows
      FROM (
            SELECT *
              FROM tutorial.crunchbase_investments_part1
    
             UNION ALL
    
            SELECT *
              FROM tutorial.crunchbase_investments_part2
           ) sub
  • 相关阅读:
    C#编程规范
    实用JavaScript代码库
    JSON入门
    asp.net基本事件
    iOS开发笔记 3、iOS基础
    iOS开发笔记 8、真机调试和发布软件
    数据同步框架MS Sync Framework不同场景使用例子和简要分析
    使用AppFabric 承载WCF和WF服务实例分析和说明
    .net程序员的iPhone开发MonoTouch
    iOS开发笔记 4、iOS中的Cocoa、设计模式等
  • 原文地址:https://www.cnblogs.com/chickenwrap/p/10023965.html
Copyright © 2011-2022 走看看