Bulk Operations

Sometimes you need to Extract/Transform/Load data en masse. In these scenarios, you will be doing bulk operations for performance. There is a Bulk Copy Program called bcp that can import and export data from/to a file. It is a piece of SQL Server. Instead it is installed along of it. You call it from the command prompt.

Within SQL Server, you can insert data from a file using BULK INSERT. There are lot of options you can specify, such as the size of the batch for transactions. You can also specify the name of a file to log errors to. The input file can be in CSV format. It can also be in some other custom format defined by a format file you specify.

Alternatively, you can use the OPENROWSET() function to bulk read data from a file. You must specify the data file input parameter as BULK. This bulk read can be done over a network. Under the hood, this utilizes OLE DB technology to get the data. Errors can be logged to a file you specify. The format of the data file can be CSV, or a custom format you define in another file.

XML Overview and Data Type

XML is an expandable/extendable syntax. You need to meet certain rules to qualify as being well formed XML. Elements are placed between names in < >. You can nest elements within each other. There is one root element in a well formed XML document. Elements can have attributes.

Variables can be of type XML. You then assign XML that is in string format to those variables. You can retrieve data from that variable. By default, it comes back in 1 row/column. This data can also be stored in the column if that column is of type XML.

There are multiple ways to retrieve data from a regular table into XML output format. The easiest is to SELECT the data FOR XML RAW. That by default produces elements that are named "row". The data in each element is a row from the table you are selecting from. There are options to change the name of the element, make the source columns be output elements, and so on.

Another way to retrieve data is to SELECT the data FOR XML AUTO. The first table selected will generate the outer elements. If you SELECT another table in the query, those records become nested elements in the output.

If you don't like elements and sub-elements, you can use FOR XML PATH. This gives you control over which source data becomes elements, and which becomes attributes in the output.

There is one more way to retrieve data into XML output. That is FOR XML EXPLICIT. This requires specific items being selected to control the output such as the root element tag name. It is not a common technique for grabbing XML data.

Unique Identifiers

There are multiple ways to generate an identifier to unique refer to a row in a table. You could define a column as a numeric type with the IDENTITY tag. The drawback is that this works for newly inserted rows, but not date already in the table. You also cannot choose your own value for that column. The system determines it.

You can also employ globally unique identifiers (GUIDs). These are columns with data type UNIQUEIDENTIFIER. This produces a 128 bit integer. The NEWID() function gives you a random value every time you call it. Put that value in the GUID column. When you look at it, the value will be show in hexadecimal. If you want to use GUIDs, and are able to let the system generate the values automatically like IDENTITY, you can use NEWSEQUENTIALID() instead of NEWID() in the column's default constraint. This leads to better performance.

If you want more control over the numbers generated, you can use sequences. The sequence is an object by itself not associated with any specific table or column. You define properties of the sequence at creation time such as the starting number, how much the value increments each time you get a new value, max/min values, and whether the numbers cycle back to the beginning when the max is reached. Use the NEXT VALUE FOR the sequence to get your generated ID.

SQL Server Odds and Ends

Metadata for the system is stored in views that reside in the sys schema. Some examples of these are sys.objects, sys.tables, sys.queries, and sys.views.

Synonyms are objects that refer to other objects. For example, they can point to a table. The underlying table does not have to exist at synonym creation time. The syntax is CREATE SYNONYM and then the synonym name, and then FOR plus the name of the object you are referring to.

Dynamic SQL is SQL that is stored in a string and executed from it. You run the SQL using the EXEC command, passing the SQL in parentheses. You can build up the string in code. Just got to be careful if the string is built from inputs that might be unexpected. Worst case, a malicious actor could inject some nasty SQL into your string which you execute.

Functions Overview

There are two broad categories of functions in SQL Server:
  1. Scalar functions
  2. Tabular functions
Scalar functions are those that return a single value. When you define them, you declare what type the function will returns. You surround the body with BEGIN and END. You use the RETURN statement to pass back that single value to the caller.

Tabular functions are further broken down into two subsets: inline table functions and multi-statement table functions. Inline table functions are simple. You don't need a BEGIN and END. You just declare them are RETURNS TABLE. Then you immediately state it AS RETURN plus a SQL query in parentheses. The result set is passed back to the caller.

Multi-statement table functions are more verbose to code. You state the name of a variable as the return type. You then declare that as a TABLE and provide the table definition in parentheses. Finally you have a BEGIN and END. Inside you will insert into that variable you previously declare as the return type.

Common Table Expressions

Let's talk about Common Table Expressions (CTEs). A table expressions is a named query. A CTE is a query you put at the front of the statement that you can later refer back to in the rest of the SQL. You start using the keyword WITH, and then enclose your CTE SQL in parentheses. You follow that with the keyword AS, and provide an alias.

If you have multiple CTEs, you separate them with commas. Each CTE needs its own unique alias. Then in the rest of your SQL statement, you can refer back to the CTE via the alias. The results of the CTE are temporarily stored as a derived table for the duration of the query. You cannot refer to the CTE outside of this query.

CTEs help with readability. They are also helpful if there is some data that you need to grab outside of the main query where you are assembling the results. A good example of where you can benefit from a CTE is as part of a PIVOT.

There is a special kind of CTE called a recursive CTE. You need two parts for this. The first is an anchor which is stand alone. The second is one that will repeats for each level of recursion. You UNION these two pieces together. The recursion continues until there are 0 rows left.

Derived Tables


I have previously described how to use subqueries to connect two tables. You can also accomplish the same thing with derived tables. These are where you issue a query, surround it with parentheses, and assign an alias to the result. Then the results behave just like a table that can be joined with an outer table.
The reason to use a derived table is that it might, in some circumstances, be faster than using a join. The simplest of derived tables have a stand alone query you can independently run. But that is not a requirement. The SQL for the derived table can reference columns in the outer query. This is called a correlated subquery, and can provide a way to tie results between the two tables.