datacompression

Sunday, September 7, 2008

Tuesday, May 27, 2008

So I did end up dividing the large text file into 26*26*26 blocks of files based on starting three characters, and it increased the speed to about 2 Kb /Second over a four hour period. As before the first hour was very fast presumably because of the lack of corresponding unscubscribies in the unsubscribe list.

The system resources consumed which initially were minimal increased to a huge number as shown below, and I think indicate that the file connections are not being closed properly, even though I use the StreamWriter.Dipose() method to close all file streams that are open.

In addition to the RAM consumed is constantly growing which is worrying me for the sake of program and whether its eventually going to crash.

The resources are now at 822,000 KB.

.....

Monday, May 26, 2008

Split Unsubscribe List

Due to the high performance overhead required in SQL Server 2005, I instead switched to using a direct Text to Text unscubscribe operation.

Initially, I split the 65 million unsubscribe list into 26 differnet parts, based on starting Character (a-z) eventually making it 27 for emails starting with digits.

Then I wrote a small program to test the effectiveness of searching through the entire list, using this as our primary indexing system. The effeciency was not as good, as I hoped for with a Big(O) worst case complexity of about 65 Million By 9 Million / approximately 20

The second step was to increase these 26 different parts to 676, based on the first two characters and eventually three characters.

At this point this would involve 17576 division of the unsubscribe list, running into possible conflict with the operationn system's file management system, complaining about so many new files created in a short period of time. It crashed my computer.

Further work continuing..on this, but this seems to be so far the most effective way.

One thing I noticed was that in the second Step, with indexing Key = first two characters, the program was processing almost 1 Megabyte of data in a minute (versus a kilobye) for the first 2 MB. Further analysis shows that this was for the emails starting with 0-9 (or digits) that had fewer corresponding unsubscribes in the unscubscribe list.

This leads me to conclude that the biggest performance strain in this algorithm is the looping through the individual files to check if a record exists, and splitting into 17576 files should get excellent results.

Project Scope

To improve BIG(O) efficiency of large Text file (>1 GB), involves streamlining Input/Output Operations, reducing overheads (avoiding databases), while being able to perform normal data processing functions.

Problem #1 Definition: Text File (A) containing a list of 9 million email addresses, and a Text File (B) containing a list of 65 email addresses.
Every Email that exists in Text File A must be scrubbed against Text File B, following this algorithm.

For Each Email in A
If Email exists in B
Remove from A
else
Keep Email in A
Next

Language of Choice: C, C# or Python.

Maximum Available Processing Time: 24 Hours
Maximum Available System Speed: 4 GHZ
Maximum Available RAM: 2 Gigs