SQL Server HTML to Plain Text Function

This week, I had a requirement to convert some HTML text in a SQL column to plaintext and store that plaintext in a different database. I found this solution for html to text, but it’s using a WHILE loop and declaring variables, and we all know that loops make SQL server work row-by-row rather than exploiting it’s data-set processing power, right? (To be fair, it’s a very old post, and it’s likely that my function wasn’t a possibility when it was written.)

So I used the same string functions from that post, made some other improvements (converting HTML line feeds with text ones, using nvarchar data types instead of varchar) and wrapped it up into a function. If you want to grab the function and go, here it is. We’ll follow up with an explainer and a test of our function in case there’s confusion on how to put it into use.
Continue reading

Posted in MS SQL, Transact SQL | Leave a comment

Validating a Bank Routing Number on a Web Form

Have you ever had to collect a routing number and bank account number on a web form?

I’ve worked on a lot of software where we collected this information (I’ve spent a lot of time on banking systems), but never on a web form. One thing I remember is that the routing number has a check digit, so that only one number out of 10 is an actual valid routing number. The routine for calculating that check digit is universal, at least for banks that go through the US Federal Reserve.

Here’s a writeup of how that digit gets calculated, copied from a doc that I wrote in 1999 (I’m so old!).

The program will provide a validation of the ABA route/transit number entered for a bank where ACH deductions are being set up. The following logic is used for this validation:

The 1st, 4th, and 7th digits are multiplied by 3. The 2nd, 5th, and 8th digits are multiplied by 7. The 3rd, 6th, and 9th digits are multiplied by 1.

Sum all of the multiplication results and divide by 10. The remainder must be zero.

Routing Number = 0 5 3 1 0 0 3 0 0
Multipliers 3 7 1 3 7 1 3 7 1
Products 0 35 3 3 0 0 9 0 0

0 + 35 + 3 + 3 + 0 + 0 + 9 + 0 + 0 = 50.

Since 50 is divisible by 10, this is an acceptable number.

I’ve run into the case where we need to validate it on a web form, so I implemented this validation in a javascript function. It’s pretty self-explanatory.

So, you just pass the number from your input as the parameter n, and it returns a boolean true if it’s valid, and false if it’s not.

Posted in Uncategorized | Leave a comment

SQL Tricks – Remove Interior Spaces

This week, I was given a set of data to import to a database. After the import, I noticed that the source data had lots of unnecessary spaces all over the place. I can, of course get rid of leading and trailing spaces by applying RTRIM() and LTRIM() to my import, but I also had space padding inside some of the names themselves. For example, there might be four spaces between the first and last names. How it got that way is kind of complicated, and not particularly relevant, but I knew that having such ugly data in my database would bug me.

I figured this was well-trodden turf, but I didn’t like the first couple of google results I got. They either didn’t keep stripping spaces after the first few, or they were unnecessarily complicated. I was pretty sure that I could implement a recursive CTE to fix my names.

Let’s set up a test. (Hat tip to this random name generator.)

Continue reading

Posted in MS SQL, Transact SQL | Tagged , , , , , , , | Leave a comment

Vigenere By SQL

After that previous post, I suppose it was natural for me to wonder what it would take to apply Vigenere cypher logic using T-SQL.

I knew right off that the code would look much different. Mainly, because my self-imposed rule is that I try very, very hard never to use a loop. I’ve written it before, but loops in T-SQL make the server work on your data row by row, which is madness, considering the millions of man-hours that engineers have spent optimizing its ability to work on large data sets in parallel.

I could go deeper into that rant, but you’ve probably heard it before.

Anyway, I stayed up late the other night to see what I could do, and it turned out to be easier than I expected.
Continue reading

Posted in MS SQL, Transact SQL | Tagged , , , , , , , , , , , | Leave a comment

Coding Vigenere Cyphers

Recently, I rewatched The Prestige, the very good Christopher Nolan move where Hugh Jackman and Christian Bale play rival magicians in the late 1800s. Jackman’s character has an old journal, written in cypher text that he slowly and meticulously needs to work out over the course of the movie.

I’ve always been fascinated by encryption, and I remembered that he used a Vigenere Cypher, but I had to look back up how those work.

Ceasar Cypher

Imagine a cypher where you just switch each number an agreed-upon number of letters up the alphabet. If you’re number is 3, then you would substitute D for A, E for B, etc. X would loop around and become A, Y would be B, etc.

This is a Ceasar Cypher, reportedly used by Julius himself, but I bet it didn’t take too long to break that code. If you think about it, the cypher that millions solve in the daily paper is an order of magnitude more difficult.

Enter Vigenere
Continue reading

Posted in Javascript, Swift | Tagged , , , , , , | Leave a comment

SQL For Fun: Parsing the iTunes XML

What happens if one part of your personality is data-parsing nerd and another part is music lover nerd?

You end up trying to answer questions like, “Which songs did I have rated with 4 stars last year that are lower than that now?”

Or, “How many songs that were added to my library before this year started have been played this year?”

iTunes makes it easy to create smart playlists that can answer a lot of these questions. But iTunes keeps your current play count and your current rating. You can’t get information about the state of your music library in the past and then run queries to see how things have changed.

It does, however, let you export a snapshot of your library data to an XML format, and if we save that XML with time stamps in a SQL database, we can answer these questions ourselves.
Continue reading

Posted in Apple, iTunes, MS SQL, Transact SQL | Leave a comment

Using Tally Table to Remove Invalid XML Characters

A couple of weeks ago, I was called in to troubleshoot an error occurring in a SQL stored procedure. The procedure was selecting a variety of information out as XML data. The error occurred because on of the columns being included was a text data column, which included raw text notes brought in to the system from a variety of sources. Some of these sources included characters that are invalid in XML.

Thinking this was a common issue, I went to google. I found the common solution was something like this:

This code is fine. It works well, and I like using the PATINDEX function to evaluate a string of characters against a REGEX expression. (I have a newly discovered appreciation for REGEX. I should post more about that…)

How Do We Improve This?

One thing sat poorly with me about this solution, though. I don’t like the WHILE loop.

Transact-SQL is built to work on big sets of data all at once. When you use a WHILE loop, you’re forcing the database to work on one piece of information at a time.

Over the last several years, I’ve convinced myself that any data manipulation that your programmer brain wants to do with a WHILE loop can be done more quickly using a Tally table. If you aren’t using Tally tables regularly, go ahead and read from this link. No single article has improved my skill set more than this.

So, I wrote an equivalent function to remove my XML characters.

What Are We Doing, Exactly?

The Common Table Expressions t1, t2, t3, t4, t5, and cteTally just build a Tally table — a result set that spits out numbers in order. In this case, I’ve cross-joined enough rows to return numbers from 1 to 4,294,967,296 which is ridiculous overkill.

The CTE named cteEachChar splits our text into a list of individual characters.

In this subquery, N is the column representing the number from cteTally. The column returned as Ch, then, is the Nth character of our test string. You end up with one character on each row if you select directly out of this CTE.

The WHERE clause uses the LEN function so that there’s no need for the server to evaluate rows up to 4,294,967,296. I would think it helps performance when there’s a short string, but I didn’t test it to see for sure. The more important term in the WHERE clause is our same PATINDEX, which removes those rows where the characters are invalid for XML. Mission complete!

Well, almost complete. We still have to merge our single-character rows back into one big line of characters.

That brings us to our last CTE. Here, we use a trick for combining information from different rows into a single result using the FOR XML construct.

I wish I knew who to whom to credit this construct. I’ve been using it for a long time to combine different rows’ information into a delimited list. (SELECT ‘, ‘ + Ch [text()]), then use SUBSTRING to get rid of the first comma and space.

My original function puts it all back together again, but without my delimiter, it changes my spaces to “ ”. The final select replaces those with the actual space that I want.

No real reason that I specified the column name next to the name of the CTE in this case rather than in the select statement itself. I normally like having column names in the SELECT, because I find that a little easier to read. But I can be flexible with it. 🙃

How Does This Really Perform?

In the data set where I hit this problem, the old script and my new made no discernible performance difference. They were both more or less instant. But if I bundle this up as a function, someone somewhere is going to apply it to millions of rows, each potentially having millions of characters of data. So let’s test with something big.

I created a test string, made up of the sentence, “The quick brown fox jumped over the lazy dogs.” I dropped 800 invalid XML characters into the middle of this, and then replicated that big character string a million times. Then, I added a variable for start time, and a DATEDIFF function at the end so we can see how long it took to run. Here are the final queries and results from my underpowered development machine.

(As with all SQL performance tests, much of this is dependent on hardware and on what else our servers are doing, so we can compare two results, but anyone’s specific result on a specific machine may vary widely.)

The results on my underpowered development box:

Now, the full query and the results on the same test data using the Tally Table method:

Repeated tests showed execution times in the same general area.

It seems like a lot of work to knock out 500 milliseconds, but I look at it as reducing the time by two orders of magnitude. Perhaps someday, this method reduces someone’s job from running into 100 hours down to running in one hour.

Posted in MS SQL, REGEX, Transact SQL | Tagged , , , , | Leave a comment

Using SQL to Mege Three Incomplete Data Sets

You can join data sets. LEFT JOIN, RIGHT JOIN, even OUTER APPLY. There’s nothing to learn here.

Last week, I had what seemed to be a simple join of three data sets. My permutation was a lot more complicated than this, but I had to boil it down to this example in order to think it through.

What makes it tricky is that none of the three sets of data have ALL of the keys that we need in order to represent our full, inclusive result set.

Let’s look at an example.

Table 1:

Key Animal
A aardvark
B bird
C cat

Table 2:

Key Fruit
B banana
A apple
P pear

Table 3:

Key Activity
A act
C climb
P push
R run

And then we’re looking for a result like this:

Key Animal Fruit Action
A aardvark apple act
B bird banana NULL
C cat NULL climb
P NULL pear push
R NULL NULL run

Let’s create our data and figure it out.

We’re going to need a full outer join between the three tables, so let’s start with that.

The key that makes this tricky is in that 2nd outer join. Because our key column might or might not be in either of our first two tables, we have to tell the server to try the join to t1, but also tell it to look at t2 if the key isn’t there in t1.

If we were to go to four tables, we would have to catch all of the permutations in order to be sure our join would work correctly. It would quickly get difficult to read, and I suspect, difficult for the server to perform efficiently.

We can use the COALESCE operator to simplify.

With this syntax, we can just keep adding columns on there as we add tables. It’s not exactly telling server to do the same thing. It’s saying that if t1.col1 matches, join there. Otherwise, move along to t2.col1. But since our keys are common in these sets, the result is exactly what we want.

Speaking of results, we’re not quit there yet.

act

col1 animal col1 fruit col1 activity
A aardvark A apple A
B bird B banana NULL NULL
C cat NULL NULL C climb
NULL NULL P pear P push
NULL NULL NULL NULL R run

Let’s use COALESCE again and put this one to bed.

Bingo!

Key Animal Fruit Action
A aardvark apple act
B bird banana NULL
C cat NULL climb
P NULL pear push
R NULL NULL run
Posted in MS SQL, Transact SQL | Tagged , , , , , , , | Leave a comment

Find DTS and FTP Job Steps From msdb

I started a new full-time contract this month, which is always weird. Not having a full understanding of the project, access rights roadblocks, and unfamiliarity with the code base all team up to make days feel long and unproductive.

One of the things I needed to do is to find where data is regularly getting imported into a database. I came up with this query on msdb to show me scheduled tasks that run FTP commands or DTS packages and had been run during this calendar year.

 
The items in the where clause are pretty much self-explainitory and can be changed to meet the need of any SQL detective. It’s one I’ll keep in my toolbox for a while, I suspect.

Aside: what’s with the 1 = 1 at the top of the WHERE clause?

If I’m doing a lot of commenting and uncommenting of conditions, I like to just start my WHEREs with 1 = 1. That way, I don’t have to worry about whether or not I need to include the AND when I comment something. I can just start dropping “–” at the start of each line and not think about it any more than that.

Posted in MS SQL | Tagged , , | Leave a comment

Logical Tests Using OR in Crystal Reports

Consider these two formulas in Crystal Reports:

One would think that these two should behave the same, right?

In fact, the results are different if Table.Value is null.

In the first case, “if isnull({Table.Value}) gets evaluated first. Since that’s true, and the other condition is connected with an OR, the system (apparently) doesn’t evaluate the other condition. You get “Result 1”, which is what I would want and expect.

In the second case, tonumber({Table.Value}) gets evaluated first. The tonumber() function on a null value returns neither true nor false. That makes sense, I guess, but you can’t evaluate a null in an OR and get a true or false either. So for this formula, neither result is presented.

Admittedly, I hit this using an embarassingly old version of Crystal Reports. I’ll have to test to see if the same thing happens on something more modern.

But for now, the lesson is to put isnull() parts at the beginning of logical tests.

Posted in Uncategorized | Tagged , , , , , , , , , | Leave a comment