Tag Archives: code

ASCII Delimited Text: Why the “Right” Answer Isn’t Very Useful

To overcome some issues with the SQL Server bulk import/export process, I went shopping for a an unquoted, delimited format this morning.  There’s a very old, very simple solution that would have worked fine in my specific circumstance.  Here’s why I’m not going to use it:

The problem I had was that I need to export using the SQL Server bcp utility and re-import using BULK INSERT.  While you can specify the separators with these tools, they don’t handle quoted fields at all, i.e. quoting, if present, doesn’t actually protect the separators should they appear in data.  Per the documentation:

If a terminator character occurs within the data, it is interpreted as a terminator, not as data, and the data after that character is interpreted as belonging to the next field or record. Therefore, choose your terminators carefully to make sure that they never appear in your data.

This was a problem, because my data actually could contain tabs, carriage returns and line feeds, which were the default separators.  It also definitely contained commas.  That puts all the common delimiters out of commission.  “Choose your terminators carefully” is pretty vague, but I figured why not give it a shot?

The approach I almost took was to use the ASCII “record separator” and “unit separator” (ASCII 30 and 31, HEX 0x1E and 0x1F respectively).   Here’s a blog post that nicely sums up the format and why it’s historically and logically valid.

Though it’s not well-documented (somewhat contra-documented), SQL Server supports this.   Even though they suggest that only “printable” characters and a small set of control characters are valid, I had no problem exporting an ASCII-delimited text file using a command like this:

bcp SELECT "x,y FROM t" queryout "c:\filename.asc" -c -r 0x1e -t 0x1f -UTF8 -T -Slocalhost"

I didn’t get as far as trying the BULK INSERT on the other end, though, and here’s why…

Once I had that “ASCII delimited” file, I opened it in NotePad++ to verify that the format was readable and correct.  It was, but the effect wasn’t pretty.  I immediately realized that if I wanted to do anything else with this data–extract a column as text, import into Excel–I was going to have problems.  Excel’s text import wizard, for example, doesn’t support control characters other than tab.  This wasn’t really news to me as I see weird control characters in ASCII all the time working with magnetic stripes, bar codes, RFID, and other encoding and storage mechanisms.  Yes, you can eventually get at the data to a usable form with search and replace, or worst-case regular expressions, but why make it hard to manage if you don’t have to?

In my case, the whitespace control characters in the data improved readability but weren’t functionally required–the data payload itself was code.  Plus, I had comment blocks available as a “quote-like” protection structure.  So, I ended up compromising on replacing the whitespace control characters in a such a way that I can get them back if I want to, or leave them alone for the automated process.  What I ended up doing was this:

bcp SELECT "x,REPLACE(REPLACE(REPLACE(y,CHAR(13),'/*CR*/'),CHAR(10),'/*LF*/'),CHAR(9),'/*TAB*/') FROM t" queryout "c:\filename.tsv" -c -UTF8 -T -Slocalhost"

That produces a “normal” tab-separated file with CRLF as the record separator.  I knew that “x” couldn’t contain those characters, so by replacing them out of “y” I have a file that safely exports and imports while being viewable, editable and importable using normal tools without excessive processing.

I wish we had kept better support for ASCII control characters in our tools as we moved forward with new technologies–it would have been useful to have distinct separators that never appeared in data (until, inevitably, your data payloads themselves contained ASCII-delimited data, at which point you’re back to quoting or replacing or choosing new separators… rinse… repeat).  Of course another solution would have been making the SQL Server utilities just slightly more robust so they could process modern quoted .csv and .tsv formats.  There’s always version.next, right?

Invalid FORMATETC structure (Exception from HRESULT: 0x80040064 (DV_E_FORMATETC)) on Simple Windows Forms Drag-and-Drop Implementation

When creating a Windows Forms application in Visual Studio and implementing drag-and-drop you may encounter this exception during debugging if you drag outside your own application, even though the drag and drop operations complete successfully inside your application.

This may not be a bug, or even an error!  Things to try:

  1. Run the application executable directly from the debug folder, not in Visual Studio debug mode.
  2. Run in VS debug mode with your application directly over a window that can accept its drag content type.  Perform the drag directly from your application to the valid drop zone on the other application, passing over no other applications (including the desktop) or invalid drop zones.  For example, if you are using DataObject.SetText, place your application directly over the text area of a text editor that accepts dragging, e.g. WordPad.

In those two conditions, you probably won’t see the exception thrown and the drag-and-drop will succeed.

I think what’s happening here is that Visual Studio is being a little too aggressive in watching the Windows event messaging system.  Because of the way dragging works, these “first chance” COM exceptions will occur as the dragging mouse passes over targets that cannot accept the content stored in the DataObject.  If they have any drag-drop awareness, they will attempt a COM native “get” operation on the FORMATETC structure created when you initiated the drag in your application (you used the DataObject wrapper and injected it with DoDragDrop, but this is what you did in effect).  If the format doesn’t match any of the formats the dragged-over applications (including Windows Explorer, a.k.a. the desktop) accept, this exception is thrown.  It is then, typically, handled either by that application or Windows itself, for example by switching to the “um, not so much” cursor (circle with line through it).  Running outside of debug mode, or between two application that agree on format, it “just works” (dropping successfully in places it can, blocking the drop in places it can’t).  In debug mode, VS is telling you, “hey, look, an exception, you could handle this if you wanted to,” but in most cases you’ll just let the OS or other applications handle it.

Bottom line: don’t sweat this one too much if your application runs okay outside the debugger and in the controlled debugging conditions described above.

Here’s an article that is the closest thing I could find to an official explanation.

Return a Record for Each Date Between Two Dates in SQL Server >= 2005

Blogging this so I don’t forget it…

It used to require some fairly ugly, resource intensive hacks (cursors, temp tables, etc.) to emit an inclusive list between two data points when the source data might not include an entry for every point (for example, a calendar, where not every day contains an event). In SQL Server 2005 and above, this is trivially easy, with a Common Table Expression (CTE) and a Recursive Query. To emit one record for every date between 1/1/2008 and 1/31/2008, you do this:


WITH datecte(anydate) AS (SELECT CAST('1/1/2008' AS datetime) AS anydate
UNION ALL
SELECT anydate + 1 AS anydate
FROM datecte AS datecte_1
WHERE (anydate < CAST('2/1/2008' AS datetime) - 1)) SELECT anydate FROM datecte AS datecte_2

If you need more than 100 days (the recursion limit is 100), add this to the end:

OPTION (MAXRECURSION 1000)

The fact that they stop recursion short at 100 by default would seem to indicate that this is an expensive procedure, but even if you're just using this to produce a dummy table with all the dates for several years, it's a nice shortcut.

I just tried the following query, which emits a record for every day between 1/1/2000 and 12/31/2020:


WITH datecte(anydate) AS (SELECT CAST('1/1/2000' AS datetime) AS anydate
UNION ALL
SELECT anydate + 1 AS anydate
FROM datecte AS datecte_1
WHERE (anydate < CAST('1/1/2021' AS datetime) - 1)) SELECT anydate FROM datecte AS datecte_2 OPTION (MAXRECURSION 10000)

On my P4-641+ the script emits 7671 records in 0 (that's zero) seconds and "spikes" the processor to all of 3%. Granted this is not a complex query, but at least we know the recursion (if it really is recursion internally, which I doubt) isn't expensive by itself.