ASCII Delimited Text: Why the “Right” Answer Isn’t Very Useful

To overcome some issues with the SQL Server bulk import/export process, I went shopping for a an unquoted, delimited format this morning.  There’s a very old, very simple solution that would have worked fine in my specific circumstance.  Here’s why I’m not going to use it:

The problem I had was that I need to export using the SQL Server bcp utility and re-import using BULK INSERT.  While you can specify the separators with these tools, they don’t handle quoted fields at all, i.e. quoting, if present, doesn’t actually protect the separators should they appear in data.  Per the documentation:

If a terminator character occurs within the data, it is interpreted as a terminator, not as data, and the data after that character is interpreted as belonging to the next field or record. Therefore, choose your terminators carefully to make sure that they never appear in your data.

This was a problem, because my data actually could contain tabs, carriage returns and line feeds, which were the default separators.  It also definitely contained commas.  That puts all the common delimiters out of commission.  “Choose your terminators carefully” is pretty vague, but I figured why not give it a shot?

The approach I almost took was to use the ASCII “record separator” and “unit separator” (ASCII 30 and 31, HEX 0x1E and 0x1F respectively).   Here’s a blog post that nicely sums up the format and why it’s historically and logically valid.

Though it’s not well-documented (somewhat contra-documented), SQL Server supports this.   Even though they suggest that only “printable” characters and a small set of control characters are valid, I had no problem exporting an ASCII-delimited text file using a command like this:

bcp SELECT "x,y FROM t" queryout "c:\filename.asc" -c -r 0x1e -t 0x1f -UTF8 -T -Slocalhost"

I didn’t get as far as trying the BULK INSERT on the other end, though, and here’s why…

Once I had that “ASCII delimited” file, I opened it in NotePad++ to verify that the format was readable and correct.  It was, but the effect wasn’t pretty.  I immediately realized that if I wanted to do anything else with this data–extract a column as text, import into Excel–I was going to have problems.  Excel’s text import wizard, for example, doesn’t support control characters other than tab.  This wasn’t really news to me as I see weird control characters in ASCII all the time working with magnetic stripes, bar codes, RFID, and other encoding and storage mechanisms.  Yes, you can eventually get at the data to a usable form with search and replace, or worst-case regular expressions, but why make it hard to manage if you don’t have to?

In my case, the whitespace control characters in the data improved readability but weren’t functionally required–the data payload itself was code.  Plus, I had comment blocks available as a “quote-like” protection structure.  So, I ended up compromising on replacing the whitespace control characters in a such a way that I can get them back if I want to, or leave them alone for the automated process.  What I ended up doing was this:

bcp SELECT "x,REPLACE(REPLACE(REPLACE(y,CHAR(13),'/*CR*/'),CHAR(10),'/*LF*/'),CHAR(9),'/*TAB*/') FROM t" queryout "c:\filename.tsv" -c -UTF8 -T -Slocalhost"

That produces a “normal” tab-separated file with CRLF as the record separator.  I knew that “x” couldn’t contain those characters, so by replacing them out of “y” I have a file that safely exports and imports while being viewable, editable and importable using normal tools without excessive processing.

I wish we had kept better support for ASCII control characters in our tools as we moved forward with new technologies–it would have been useful to have distinct separators that never appeared in data (until, inevitably, your data payloads themselves contained ASCII-delimited data, at which point you’re back to quoting or replacing or choosing new separators… rinse… repeat).  Of course another solution would have been making the SQL Server utilities just slightly more robust so they could process modern quoted .csv and .tsv formats.  There’s always version.next, right?

Leave a Reply

Your email address will not be published. Required fields are marked *