A Silly Hack Around the Excel Cell Character Limit

2016-07-20 by richard in Software Development

Users will always find the limits of your application.

I manage a web application where users primarily edit data by importing and exporting Excel files. This allows them to work on huge swaths of data without the limitations of a web-based interface. The workbooks store 1-to-many relationships using comma-delimited lists of integer IDs, and use macros, validation, etc. to maintain referential integrity.

For almost all cases, this works perfectly, it’s unusual to have more than 100 relationships of a given type for any entity. But for one relationship, there could be thousands. And that’s where we found the limit.

The limit is that a single Excel cell may only hold 32,767 characters. Our IDs have an average of 8-10 characters, plus the comma delimiter overhead, so we max out at around 3,000-4,000 relationships. The exceptions are rare, but they do happen, so I needed a solution.

My first thought was to use hexadecimal to reduce the average number of digits required. While decimal encodes around 3.3 bits per digit, hex gives 4. But that would only improve our situation by around 20%. I feared we would find a new exception that would bust the limit again within a few months. I needed a step change.

I felt like switching bases was on the right track. If I could get a 20% improvement with hexadecimal, what base would make me comfortable that we’ve addressed the limitation for our users?

Base64 was the obvious next choice. It’s well-known and has native support in C# (for the ASP.NET side of things), and its six bits per character encoding would nearly double my relationship limit. However, the “+” symbol had a special meaning in my data loader, and the letters “E” and “e” could trick Excel into treating the encoded string as a number in certain situations. I could substitute those characters in my algorithm, but for simplicity, I decided to use only [A-DF-Za-df-z0-9], effectively creating a Base 60 encoder.

This worked, and gave me the headroom I needed for now, but it got me thinking. What if I could pack even more bits per character? Where would I hit the ceiling where the code complexity or performance were sub-optimal?

Using additional ASCII characters was out — I’d already used 60, and after discounting control characters and others with special meaning, I wouldn’t see enough improvement to bother. But to go to higher-order characters, I had to know if Excel’s limit was truly a character limit or a byte limit. Fortunately, this was easy enough to test using REPT() and a Unicode character with a value over 255. It’s definitely a character limit.

So, I could use any Unicode character I wanted, other than [+*,~-], the characters with special meaning for my data loader. But Unicode itself is a Swiss cheese patchwork of allocated and unallocated code points, and even many allocated characters have poor font support. While users don’t have to understand the values, I didn’t want a bunch of rectangles or invisible characters. Unicode also has random code points assigned as modifiers, whitespace, or other poor candidates for a human-readable (if human-incomprehensible) encoding. I wanted to squeeze some real juice from the 16-bit character space (I didn’t test Excel’s limit for characters above 65535), but I didn’t want to have code that required tacking together a hodgepodge of value ranges.

I went on a little search through Unicode, looking for the longest contiguous range of allocated characters that would show up in a US English setting in Windows without needing special fonts. Not surprisingly, this turned out to be the CJK Unified Ideographs, the blocks in the range 4E00–9FFF. Within this range, the first 20,950 code points are allocated for Chinese characters, all of which are supported by Windows by falling through whatever Western font you use in Excel to whatever the default Chinese font is on your version of Windows.

While I could use all 20,950 characters, I decided to use 14-bit encoding, the first 16,384 characters. There’s something satisfying about an even power of 2, and while it shouldn’t matter for my application, the micro-optimizer in me likes the fact that I can use bit-shifting in the conversion process.

The final result? My encoding is now down to 1-3 characters per integer, plus the comma, giving my users and upper bound of around 10,000 relationships. This, as Bill Gates famously probably did’t say about 640KB, should be enough for anyone. I don’t foresee any business case for us that would support more than around 5,000, so having the extra headroom is nice, and the conversions are fast enough that they are a rounding error when importing and exporting workbooks.

I doubt anyone in their right mind will find a use for it, but for the curious, my code is posted here as a Gist;