Monday, May 26, 2008

Split Unsubscribe List




Due to the high performance overhead required in SQL Server 2005, I instead switched to using a direct Text to Text unscubscribe operation.

Initially, I split the 65 million unsubscribe list into 26 differnet parts, based on starting Character (a-z) eventually making it 27 for emails starting with digits.

Then I wrote a small program to test the effectiveness of searching through the entire list, using this as our primary indexing system. The effeciency was not as good, as I hoped for with a Big(O) worst case complexity of about 65 Million By 9 Million / approximately 20

The second step was to increase these 26 different parts to 676, based on the first two characters and eventually three characters.






At this point this would involve 17576 division of the unsubscribe list, running into possible conflict with the operationn system's file management system, complaining about so many new files created in a short period of time. It crashed my computer.

Further work continuing..on this, but this seems to be so far the most effective way.

One thing I noticed was that in the second Step, with indexing Key = first two characters, the program was processing almost 1 Megabyte of data in a minute (versus a kilobye) for the first 2 MB. Further analysis shows that this was for the emails starting with 0-9 (or digits) that had fewer corresponding unsubscribes in the unscubscribe list.

This leads me to conclude that the biggest performance strain in this algorithm is the looping through the individual files to check if a record exists, and splitting into 17576 files should get excellent results.

No comments: