Writing SQL using Copilot/ChatGPT

If you have been paying attention, you are likely aware that generative AI tools, like ChatGPT, can write code. Since SQL is a kind of code, it follows that AI tools should be able to write SQL queries for you. This is not cheating. The critical skills in data analysis are (a) knowing which questions to ask and (b) interpreting the answers lurking in the data. Mastering arcane syntax should not be thought of as the critical skill. Indeed, the original intention of SQL was to replace other, lower-level languages that require even more arcane syntax and drudgery. SQL is a declarative language, in which you tell the computer what you want rather than a procedural language (like C or COBOL or Python), in which you have to tell the computer exactly how to get what you want. In this sense, chatbot query construction is simply a refinement of SQL's original declarative vision.

Or, to put it another way, using a chatbot to generate a query is similar to our use of graphical query languages in this module. Our goal is the same: make data analysis easier and more accessible to the non-technical people who can interpret the business meaning of the results. Whether a chatbot interface is easier than a graphical interface is a matter of personal preference and is something I encourage you to explore.

Background: What is an AI chatbot

This analytics course is mostly about structured, transactional data—the kind of information you might find in a banking system or an inventory management database. However, there are other branches of analytics/artificial intelligence/machine learning focused on unstructured data, such as images, social media text streams, and so on. The underlying tools and technologies for analyzing unstructured data are generally different from what we are seeing in this course, but the underlying objectives are usually similar: prediction and classification. An important sub-branch of unstructured data analysis is natural language processing (NLP). We have all seen NLP at work on our phones, helping us to finish our sentences or correcting typos. However, NLP can do more, including generating entirely novel content.

NLP is not part of the course—at least not yet—but you still might be interested in learning more. Chatbots are an especially hot topic. Chatbots are NLP tools that generate new text in response to your prompts and thus create the illusion of a conversation with another (very knowledgeable) entity. The Internet has many NLP learning resources available, although some are very technical. Here is an overview of chatbots that provides a good balance of high-level intuition with details about how they work. Again, this is not part of the course, but it is important.

Accessing a chatbot

The best way to learn about chatbots is to play with one. Although ChatGTP is currently the most popular chatbot, there are many others and the industry is changing very quickly (Perplexity.ai comes to mind). Plus, there are public-facing tools based on the same technologies. For example, you can try ChatGPT by visiting OpenAI's site, but using the chatbot requires a (free) account. Microsoft's Copilot (formerly Bing) chatbot is also GPT-based and, although it too requires an account (at least at the time of this writing), many of us already have a Microsoft account. Alternatively, you can access Copilot through your SFU Microsoft 360 account (see here for details). we will take advantage of this arrangement for this tutorial.

Access Copilot at https://copilot.microsoft.com/. Microsoft recommends using its Edge browser for this because its tools may or may not play nicely with Google Chrome, Safari, Firefox, et al.
Log in to your Microsoft account. This is what the interface looks like at the time of writing:

Enter your SFU email address (or your personal Microsoft account if cannot/do not want to use your SFU account).
If you have multiple Microsoft accounts, you should choose "Work or school account" to use SFU's access. This may take you to SFU's login page if you are not currently logged in.

Getting the chatbot to write your query

The screenshots below are from Bing, which is basically the same thing as Copilot. The interfaces change so quickly that it is not worth updating the screenshots.

The key to using chatbots, whether you are looking up trivia, composing haiku, or writing SQL code, is a good prompt. The implication for coding is that you have to know what you want before you ask for it. Chatbot coding replaces typing and fiddling with syntax, it does not replace thinking.

Let's see how this works:

A basic query with no database schema information

Once you are logged in, type the following into the prompt area ("Ask me anything..."):
sql show customers with sales in 2022 greater than 1 million
(I have a typo in my query in the screen shot—"greate", but Copilot ignores the error)

Your results will almost certainly be different from what is shown above.

The differences between what you get and what is shown above illustrates the first important thing about chatbot output: it is non-deterministic. If you type the exact same thing into the prompt on two different occasions, you might get different results. The second thing you should notice is that chatbots are often confidently wrong. This query looks good, impressive even. But it will not run against our data without a lot of debugging, which defeats the purpose of using the chatbot. Let's take a closer look at the response to get a better sense of the issues:

                    
SELECT customer_name
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE YEAR(order_date) = 2022
    GROUP BY customer_id
HAVING SUM(order_amount) > 1000000);

You might not understand all parts of this query. For example, it contains a subquery (two SELECT statements). But we can nonetheless identify several assumptions made by the chatbot that prevent the query from running on our database:

Obviously, it is guessing about column names like customer_name. Its guesses are pretty reasonable, but "close enough" is not good enough for SQL: the field and table names must be exactly as they are in the source database. The database engine that processes SQL statements does not know that customer_name is similar to CustomerName.
The table names are also different from those in our database.
The chatbot assumes that there is a column called order_amount in the orders table. There certainly could be, but this would be poor database design because it would involve storing the same information twice. The total amount of an order can be calculated from data that is already in our OrderDetails table, which Copilot does not know about.

The lesson we learn from this confidently-wrong result is that we have to be vigilant when using chatbots to write code. This wrong query may have some value if we consider it as a mere template and are willing to edit heavily. But checking and debugging are inevitable. We can reduce our debugging efforts if we pass the chatbot additional information.

A better query with a more specific prompt

Our next prompt passes the chatbot more information about the database structure. This is more work, and it might seem like it defeats the purpose of using a chatbot. However, recall that the most important advantage of using a chatbot for SQL is avoiding having to remember SQL syntax. Understanding the structure of the database is still your responsibility.

Type a more explicit request into the prompt area ("Ask me anything..."):
sql show CustomerName from the Customers table with TotalSales in 2022 greater than 1 million where TotalSales is QtyOrdered * UnitPrice from the OrdersDetails table and OrderDetails is joined with Orders on OrderID and Customers is joined with Orders on CustomerID

This is clearly much better. It has used the information we provided to create the correct JOIN clauses. The chatbot has even guessed that we have a column called OrderDate (with the correct case) and used the YEAR() function to extract the year. However, it has completely missed the aggregation (the SUM based on a GROUP BY clause). We can fix this by telling the chatbot that TotalSales is the sum of QtyOrdered * UnitPrice and we often get the correct result.

Before we marvel too much at this, we should acknowledge that the target of all these queries is a well-known Microsoft sample database. ChatGTP has been trained on text from the Internet, which is full of documents—like this one—that use this particular database. The results may be far less accurate for other database schemas that the language model has not been trained on so extensively. Just something to keep in mind.

A dialog with the chatbot: remembering information within a session

One of the most important capabilities of current chatbots is to "remember" the context of each prompt. This creates the illusion of a conversation. We can (sometimes) exploit this capability to pass the chatbot the schema of our database and then rely on it to use the schema information when generating subsequent queries.

Hit the "New topic" button so Copilot forgets what we have done so far (clears the context).

Type the following into the prompt area ("Ask me anything..."):


            database schema:
            Customers(CustomerID, CompanyName, Country)
            Orders(OrderID, CustomerID, OrderDate)
            OrderDetails(OrderID, ProductID, UnitPrice, Quantity, Discount)
            Products(ProductID, ProductName, CategoryID, MSRP)

Run the prompt. This is just information for the chatbot—you are asking it to remember something. It will give you back a plain-language description of the schema.

There is nothing magic or formal about the sytax of the prompt above. I just assumed that Copilot knows what a "database schema" is and that it knows a colon might imply a definition. We can now expect the chatbot to remember the schema when generating new SQL:

Ask for a query:
sql show the customer name for customers with total sales in 2022 greater than 1 million

This may or may not work for you. Or it may work sometimes and not others. This is one problem with chatbots and neural network-based technologies generally: they are black boxes. All their "knowledge" is encoded in the weights between a huge number of network nodes. We have no idea what is going on in a large model and may be constantly surprised.

A work-around if you do not get a good result is to include the schema at the start of each query request:


                database schema:
                Customers(CustomerID, CompanyName, Country)
                Orders(OrderID, CustomerID, OrderDate)
                OrderDetails(OrderID, ProductID, UnitPrice, Quantity, Discount)
                Products(ProductID, ProductName, CategoryID, MSRP)
                
                sql show the customer name for customers with total sales in 2022 greater than 1 million

This query gives us what we want. It even picked up the discount (though, as noted above, it has seen this database many times before). If we need to generate additional queries, we can preface our subsequent prompts with the same schema information. In my experiments, the context across prompts (i.e., within a chat session) is hit-and-miss whereas including large amounts of information within a single prompt almost always works.

Conclusion

Large language models pretrained on huge amounts of text from the Internet have, since mid-2022, achieved a level of capability that few would have predicted even a few years ago. The field will evolve very quickly and almost certainly transform the ways in which analytics is done. For example, one might conclude that SQL's days as a data access language are numbered. Chatbots provide a much more forgiving declarative interface. However, at the same time, we can see the limitations of the natural language interface and the advantages of formal, precise languages like SQL and its graphical equivalents. This is especially true for more complex queries.