Bulk Operations

Sometimes you need to Extract/Transform/Load data en masse. In these scenarios, you will be doing bulk operations for performance. There is a Bulk Copy Program called bcp that can import and export data from/to a file. It is a piece of SQL Server. Instead it is installed along of it. You call it from the command prompt.

Within SQL Server, you can insert data from a file using BULK INSERT. There are lot of options you can specify, such as the size of the batch for transactions. You can also specify the name of a file to log errors to. The input file can be in CSV format. It can also be in some other custom format defined by a format file you specify.

Alternatively, you can use the OPENROWSET() function to bulk read data from a file. You must specify the data file input parameter as BULK. This bulk read can be done over a network. Under the hood, this utilizes OLE DB technology to get the data. Errors can be logged to a file you specify. The format of the data file can be CSV, or a custom format you define in another file.

XML Overview and Data Type

XML is an expandable/extendable syntax. You need to meet certain rules to qualify as being well formed XML. Elements are placed between names in < >. You can nest elements within each other. There is one root element in a well formed XML document. Elements can have attributes.

Variables can be of type XML. You then assign XML that is in string format to those variables. You can retrieve data from that variable. By default, it comes back in 1 row/column. This data can also be stored in the column if that column is of type XML.

There are multiple ways to retrieve data from a regular table into XML output format. The easiest is to SELECT the data FOR XML RAW. That by default produces elements that are named "row". The data in each element is a row from the table you are selecting from. There are options to change the name of the element, make the source columns be output elements, and so on.

Another way to retrieve data is to SELECT the data FOR XML AUTO. The first table selected will generate the outer elements. If you SELECT another table in the query, those records become nested elements in the output.

If you don't like elements and sub-elements, you can use FOR XML PATH. This gives you control over which source data becomes elements, and which becomes attributes in the output.

There is one more way to retrieve data into XML output. That is FOR XML EXPLICIT. This requires specific items being selected to control the output such as the root element tag name. It is not a common technique for grabbing XML data.

Unique Identifiers

There are multiple ways to generate an identifier to unique refer to a row in a table. You could define a column as a numeric type with the IDENTITY tag. The drawback is that this works for newly inserted rows, but not date already in the table. You also cannot choose your own value for that column. The system determines it.

You can also employ globally unique identifiers (GUIDs). These are columns with data type UNIQUEIDENTIFIER. This produces a 128 bit integer. The NEWID() function gives you a random value every time you call it. Put that value in the GUID column. When you look at it, the value will be show in hexadecimal. If you want to use GUIDs, and are able to let the system generate the values automatically like IDENTITY, you can use NEWSEQUENTIALID() instead of NEWID() in the column's default constraint. This leads to better performance.

If you want more control over the numbers generated, you can use sequences. The sequence is an object by itself not associated with any specific table or column. You define properties of the sequence at creation time such as the starting number, how much the value increments each time you get a new value, max/min values, and whether the numbers cycle back to the beginning when the max is reached. Use the NEXT VALUE FOR the sequence to get your generated ID.

SQL Server Odds and Ends

Metadata for the system is stored in views that reside in the sys schema. Some examples of these are sys.objects, sys.tables, sys.queries, and sys.views.

Synonyms are objects that refer to other objects. For example, they can point to a table. The underlying table does not have to exist at synonym creation time. The syntax is CREATE SYNONYM and then the synonym name, and then FOR plus the name of the object you are referring to.

Dynamic SQL is SQL that is stored in a string and executed from it. You run the SQL using the EXEC command, passing the SQL in parentheses. You can build up the string in code. Just got to be careful if the string is built from inputs that might be unexpected. Worst case, a malicious actor could inject some nasty SQL into your string which you execute.

Functions Overview

There are two broad categories of functions in SQL Server:
  1. Scalar functions
  2. Tabular functions
Scalar functions are those that return a single value. When you define them, you declare what type the function will returns. You surround the body with BEGIN and END. You use the RETURN statement to pass back that single value to the caller.

Tabular functions are further broken down into two subsets: inline table functions and multi-statement table functions. Inline table functions are simple. You don't need a BEGIN and END. You just declare them are RETURNS TABLE. Then you immediately state it AS RETURN plus a SQL query in parentheses. The result set is passed back to the caller.

Multi-statement table functions are more verbose to code. You state the name of a variable as the return type. You then declare that as a TABLE and provide the table definition in parentheses. Finally you have a BEGIN and END. Inside you will insert into that variable you previously declare as the return type.

Common Table Expressions

Let's talk about Common Table Expressions (CTEs). A table expressions is a named query. A CTE is a query you put at the front of the statement that you can later refer back to in the rest of the SQL. You start using the keyword WITH, and then enclose your CTE SQL in parentheses. You follow that with the keyword AS, and provide an alias.

If you have multiple CTEs, you separate them with commas. Each CTE needs its own unique alias. Then in the rest of your SQL statement, you can refer back to the CTE via the alias. The results of the CTE are temporarily stored as a derived table for the duration of the query. You cannot refer to the CTE outside of this query.

CTEs help with readability. They are also helpful if there is some data that you need to grab outside of the main query where you are assembling the results. A good example of where you can benefit from a CTE is as part of a PIVOT.

There is a special kind of CTE called a recursive CTE. You need two parts for this. The first is an anchor which is stand alone. The second is one that will repeats for each level of recursion. You UNION these two pieces together. The recursion continues until there are 0 rows left.

Derived Tables


I have previously described how to use subqueries to connect two tables. You can also accomplish the same thing with derived tables. These are where you issue a query, surround it with parentheses, and assign an alias to the result. Then the results behave just like a table that can be joined with an outer table.
The reason to use a derived table is that it might, in some circumstances, be faster than using a join. The simplest of derived tables have a stand alone query you can independently run. But that is not a requirement. The SQL for the derived table can reference columns in the outer query. This is called a correlated subquery, and can provide a way to tie results between the two tables.

Subqueries

Normally when you are querying data from two tables, you use a join. However, you can also accomplish this with subqueries. One subquery technique is to use the IN operator with the WHERE clause. You provide an inner query for the IN operator. Then any rows in the outer table that match any one of the values in the IN clause produce an output row.
Another syntax to accomplish the same results as the IN operator is to use ANY. You add this to the WHERE clause, using a comparison operator such as equals with ANY. You pass the ANY a subquery to execute. If you are testing whether a column equals this ANY expression, it behaves just like the IN scenario.
The benefit of using subqueries as described above is that you can manually execute the SQL from the IN/ANY. Then you can see the list of values you are trying to match with the outer table.

Geometry in Databases

There is a whole set of functions available to work with geometry in SQL Server. Geometry refers to objects with coordinates on a flat map. This is different than geography, which uses a globe and takes the curvature of the Earth into account. Geometry is fine if you are not dealing with huge distances.
Geography is more complex than geometry. You reference location via longitude and latitude.

In geometry, you have x and y coordinates. Pair them up and you get a point. The points can be used as endpoints for more complex objects. Once you have points or other objects, you can compute relations between them such as distance and intersection.

Once you have geometry objects, you can aggregate them together. That means combine lesser objects to form another one that encompasses the source objects. The result can be a short of outline that fits all the source objects.

ROLLUP, CUBE, and GROUPING SETS

When you add a GROUP BY clause to a SELECT statement, you define a grouping set. The data is grouped by those columns. The groups will become rows in the output data. You can have more than one grouping set if you group by a GROUPING SET, ROLLUP or CUBE instead of some columns.

When you specify GROUPING SET, you provide a list of groups enclosed in parentheses. These groups are identified by some columns. Each of those will operate like the normal GROUP BY does with a set of columns. Here we just have more than one group.

ROLLUP is used when the data is naturally hierarchical. The ROLLUP is a special case of grouping set. It just chooses the combinations of groups that makes sense based on the hierarchy. You will get a top level group with just the highest column in the hierarchy. Then you also get one with this highest plus the next highest, and so on. Finally there is one group that is the grand total of everything.

CUBE, like ROLLUP, is a special case of a grouping set. It produces every possible combination of groups given the columns you specify. Get ready for a whole lot of rows, and possibly a lot of NULLs in your output if you use CUBE.

There are some handy functions when dealing with grouping sets. GROUPING_ID() tells you the level of the rollup. GROUPING() returns a 1 if the specified column is in the grouping set.

Analytic Functions


In addition to aggregate and ranking functions, you can also query analytic functions on windows of data. These were added in SQL Server 2012. There were a total of eight new analytic functions. Only a few of them are very common.
The FIRST_VALUE() function gives you the value at the beginning of the partition. Similarly LAST_VALUE() give you the value at the end of the partition. LAG() goes backwards a specified number of rows in the window. And LEAD() goes forward a specified number of rows.
There are some analytic functions that work with averages. First let’s talk about what average means. It is not necessarily the median value, which is the value in the middle of an ordered group. It is instead the sum of all values in the group divided by the number of values in the group.
You can get the numerical average using the PERCENTAGE_CONT(0.5) function. Or you could get the single value in the group that is closest to that average using the PERCENTAGE_DISC(0.5) function.

Ranking Functions

You usually select aggregate functions to operate on windows of data. However you can also query ranking functions on the windows as well. These do not operate on ranges. They almost always need an ORDER BY for window definition. Examples of such ranking functions are ROW_NUMBER, NTILE, RANK, and DENSE_RANK.
ROW_NUMBER counts incrementally from 1 within the current partition. NTILE splits the partition into a specified number of buckets, and tells you what bucket the current row is in. For example, NTILE(100) will chop the partition into 100 buckets, and you get a value between 1 and 100 from the function, like a percentage.
RANK gives you a count like ROW_NUMBER, but rows with the same value in the ORDER BY column all get the same value. So ties get the same rank. However there will be a jump in rank value after such ties. DENSE_RANK acts like RANK except it does not skip any values in the RANK even when there are ties in the column values.

Window for Aggregate Functions


When you are selecting aggregate functions such as SUM(), you should define the window over which the function will be applied. To do this, you specify the OVER() clause. If you pass no further info to OVER(), the window is the whole set of input records. That is the default. But it is not too useful.
Instead you can pass a PARTITION BY inside the OVER clause to define the column or columns that form the windows. If you pass a single column, then each unique value for that column forms a window in which the aggregates function will be applied.
If you only specify the PARTITION BY, the window is all records with the same value in the column(s) you specify. However you can also add an ORDER BY clause, which causes the window to operate differently. ORDER BY and column(s) instruct the window to be all records with the same value in the specified column up to the current row. For example, if your aggregate function is SUM(), then PARTITION BY plus ORDER BY lets you for a running total within each value for the column.
You can further control the window for aggregate functions by adding a ROWS BETWEEN clause to the PARTITION BY and ORDER BY. This allows you to define the starting row and ending row for your window. These rows can be relative to the current row (e.g. 5 PRECEEDING) or absolute (e.g. UNBOUNDED PRECEEDING).

Stored Procedures


Stored procedures allow you to bundle SQL commands for easy access. They are infrequently called system procedures. You execute the procedure by its name followed by a comma separated list of arguments. Unlike procedure calls in the Oracle database, you do not surround parameters with parentheses for the call.

When you pass parameters to a stored procedure, the order of parameters should match the order in which they are defined in the stored procedure. Alternatively you can explicitly specify the parameter names when you supply the parameters.
To get rid of a stored procedure, you need to DROP it. You can look at stored procedure metadata in the “sys.procedures” view.
 

 

The MERGE


When you have information for a table, you normally insert records if the record is new. And you update records if a record already exists. The MERGE statement allows you to handle multiple scenarios in one fell swoop.

In a MERGE, the table that you will insert into or update is the target table. The table you are getting data from is the source table. You define the rules for the operation based on whether there is a match between source and target tables.
There are some limitations using MERGE. You have to have a semicolon at the end of the statement. For a given source row, you can only MERGE into a single row in the target table.
 


UNION, EXCEPT, and INTERSECT

Normally when you execute SELECT statements, you are dealing with sets of data. There are some set operations that can work on two different sources (e.g. tables). One of the most common operations is UNION. If you UNION two data sets, you will get all data from the two sets, with duplicates removed.
There are some restrictions in place when you perform a union. The data types in each column of the two sources must be comparable. If the data types are comparable but different, the bigger data type is produced in the UNION result. The names of the columns are the names from the first source used. The order of output is not guaranteed.
If you want to do a UNION, but want to preserve the duplicates in the output, you can perform a UNION ALL. This is faster than the UNION that removes the dups. If you want to order the results of the UNION, the ORDER BY must be placed after the second query.
There are other operations in addition to UNION. EXCEPT will give you results from the first query that are not in the second. INTERSECT will give you results that are only in both queries.

Database Triggers

Triggers are code that are automatically executed when certain events occur. There are two broad categories of triggers: (1) FOR/AFTER triggers, and (2) INSTEAD OF triggers.
The FOR trigger is straight forward. It is also called an AFTER trigger. When you issue an INSERT, UPDATE, or DELETE, those operations act on the data. Then the AFTER trigger is fired. That gives you a chance to rollback the changes made by the DML.
The INSTEAD OF trigger works differently. The data changes from the original DML are not made. The INTEAD OF trigger will fire. You can then do some other work. This is useful for a view, where there are limitations on what you can do with DML.
Triggers can be nested. The outer DML can cause a trigger to fire, which in turn executes other DML that itself causes another trigger to fire. Within the trigger you can access the @@NESTLEVEL system variable to check how deep you are within nested trigger calls. It has a value of 0 for the outer DML code. It is 1 if you are in the first trigger, 2 for code called by the first trigger, and so on. The maximum depth of calls is 32.

Indexed Views

Most of the time you are querying data from a view. However, you can INSERT/UDPATE/DELETE as well, if you follow certain rules. Only one base table at a time can be modified through a view.
You can add indexes to view. This can make access times much faster. You can make a view an indexed view by ensuring the first index on the view is indexed. There can be other indexes on such a view. But they will not be clustered.
When the view is an indexed view, previous scans which went through all the data will instead seek the records using the index. The indexed view might be fast, but has some restrictions:
·         Cannot do outer joins with indexed view
·         Cannot use DISTINCT with indexed view
·         Cannot do COUNT on indexed view
·         Cannot use TOP on indexed view
Cannot use UNION with indexed view

View Metadata and Source

A view, in general, is a stored SQL query that exposes itself like a table. It can restrict some users seeing some columns in the base tables. You can join multiple tables with the SQL query of a view to make the results appear like a single table.
Information about views is found in places such as INFORMATION_SCHEMA.VIEWS. You can also look at SYS.VIEWS in SQL Server. It will show you all views that have been created. To generate a view you issue a CREATE VIEW command and specify the SQL behind the view. To remove the view, you do a DROP VIEW.
The SQL source behind a view can be extracted by querying SYS.COMMENTS. You can also use Object Explorer to spy on the SQL. You can also employ the OBJECT_DEFINITION() function and pass in the object ID of the view. The source code can be hidden if you choose the WITH ENCRYPTION option when creating the view.

Column Constraints

One of the things you can add to columns when you CREATE or ALTER a table are constraints. There are many different types of constraints:
·         Not null
·         Unique
·         Default
·         Check
·         Primary key
·         Foreign key
A not null constraint is self-explanatory. It prevents that column from holding a NULL value. A unique constraint prevents duplicate values from being put into a column. You can have a NULL value in a column with a unique constraint. But there can be only one row with a NULL. Trying to add a second NULL to a column with a unique constraint causes a violation and error.
Default constraints will let you specify a value that the column gets when you insert a record without a value for that column. Normally the column would get a NULL in this scenario. But with this constraint, it gets another value you specify for the column.
The check constraint will enforce criteria you specify. This is done for new values you insert, as well as existing data at the time when you add the constraint. The primary key constraint is like the unique constraint. However, a column with a primary key constraint cannot be NULL. By default, the primary key constraint will cause the table to be clustered (put in order using values from the column).
Foreign key constraints establish references to primary keys in another table. You can only put values in a foreign key constraint column that exist in the primary key of the other table. The only exception to this is that you can put a NULL in the foreign key constraint column.

Tables and Transactions

Normal base tables are created and worked with. They are permanent and stick around until they are dropped. You can generate a new table with a CREATE TABLE statement (DDL). But you can also create a table using the output from a SQL query. To do this you issue a SELECT INTO myNewTable. This will result in creation of table myNewTable. Its contents will be the resulting values and data type from the SELECT query.
There are also tables you can generate on the fly called derived tables. These are also created from SELECT statements that are enclosed within parentheses in the FROM clause. You need to assign an alias for the derived table. The table only exists while the outer query is executing. In Oracle, they call these inline views. There are some limitations on the type of SQL you can put into a derived table. For example, you cannot have an ORDER BY.
When making changes to data within tables, you can group operations into units called transactions. You can explicitly control when these groups are created and when their changes out made permanent. You start this up with a BEGIN TRANSACTION. Then you issue DML statements such as INSERT, UPDATE, and/or DELETE. When done, you can COMMIT TRANSACITON to make all changes final. Or you can ROLLBACK to undo everything in the transaction.

Joining Tables


Sometimes you just query a single table to retrieve data. But often you are dealing with two or more tables. To connect them, you need to use joins. There are four types of joins:
1.      Inner join
2.      Left outer join
3.      Right outer join
4.      Cross join

The inner join is the default join. Only matches from the left and right side of the join will be produced. The left outer join, or left join, will give you all rows from the left-hand side. It will also include matches from the right-hand side. Similarly, the right outer join gives you all rows from the right-hand side. It will also include matches from the left-hand side.
The cross join produces something known as a Cartesian product. Every row from the left side is matched with every row from the right side. If there are m rows on the left, and n rows on the right, the cross join will result in m x n rows.