Buffered stream reading in Cocoa

Strangely enough, there is no way in the Cocoa or Cocoa Touch frameworks to read a stream of bytes as a sequence of characters line by line. The option of creating a string from a file, with methods such as stringWithContentsOfFile of NSString, and then generating an array of lines with componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet] certainly seems simple and convenient enough. But when you’re dealing with very big text files (hundreds of Mb onwards), memory and performance become a concern, or even – on the iPhone platform more specifically – a full blown show-stopper.

Years of Java development have made me quite familiar with the I/O API of the java.io package and I got to appreciate the elegance of the decorator pattern for buffered reading and writing. So I thought I would create my own version of a decorator for NSInputStream, that should add text encoding awareness to the byte stream as well as line by line buffering.

This allows me to parse massive text files without incurring crashes on the iPhone. The need emerged in the development of the iDEX application – the update feature downloads a text file from the dexonline server which can become very long and at a mere 7-8 Mb my iPhone 3G was already finding it difficult to cope (the update process reads and saves data to sqlite, generating a fair amount of CPU activity and taking up lots of RAM).

So, here’s a short code snippet that illustrates the idea of what I wanted to be able to do

NSInputStream *in = [NSInputStream inputStreamWithFileAtPath:@"filename.txt"];
// check for errors
MDBufferedInputStream *bufstream = [[MDBufferedInputStream alloc] initWithInputStream:in bufferSize:1024 encoding:NSUTF8StringEncoding];
[bufstream open];
NSString *line;
while ( line = [bufstream readLine] ) {
// process the line
}
[bufstream close];
[bufstream release];

Although far from perfect, the MDBufferedInputStream class has been completed – it’s in what we could define as an advanced release candidate stage. The source code is available on github.

You initialise a new instance of this class with initWithInputStream:bufferSize:encoding:. The buffer size specifies the size of each chunk read from the stream. A chunk is then processed byte by byte in the readLine method, looking for new line markers where a new string is created from the bytes using the specified encoding. The class features a bytesProcessed property, that can be used to find out how many bytes have been read from the initial stream so far.

Like any Cocoa NSInputStream, you need to open it before you use it, and close it when you’re done with it. As an added bonus, the MDBufferedInputStream instance will silently open the underlying decorated stream and then close it for you when you close the decorator. But if the decorated stream was already open when you tried to open the buffered stream, you will then have total control over the decorated stream and you will need to close it yourself.

As I suggested earlier, the class does the job, but some issues need to be smoothed out. This is what I feel requires a little bit more effort

  • The internal byte parser of MDBufferedInputStream will only look for single ‘\r’ or ‘\n’ characters in order to separate the lines – the file I have been working on only uses linux-style line endings, so I’m not sure how the class behaves when faced with DOS-style text or – Heaven forbid! – Unicode line termination characters (well, these will simply not tokenise text as it stands). The class should ideally be able to handle all of this…
  • There seems to be a memory leak somewhere in the code – I haven’t been able to spot it and it might be down to my shallow knowledge of NSAutoreleasePool and the autorelease method. Help in this area would be greatly appreciated. There was no memory leak in the decorator code itself, the leak was in some other code that I wrote and that was using the decorator, leading me to believe the problem was in -readLine.
  • A couple of processing options would be nice to have: whether to return empty lines to the client or skip them silently, the option of selecting a set of characters that function as comment markers, and so on.
  • There are issues with the semantics of the inherited read:maxLength: method in the decorator class. Should it move the read pointer forward, skipping the corresponding bytes and making them unavailable to readLine? Or should the results of readLine not be altered by calls to read:maxLength:? My code does not address this at all at the moment, in fact, you should not invoke read:maxLength: on MDBufferedInputStream for now, but I think that a complete production-ready class should handle this one way or another (possibly the first).
  • If the decorator stream is used in a multithreaded context, the value of the bytesProcessed property is calculated at a very inconvenient point that will not give a detailed enough indication of the actual number of bytes processed.

If anyone can help, please feel free to submit improvements, suggestions, or comments.

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

4 Comments »

 
  • Sam says:

    Awesome man. Just what I needed. Why do you think there is a memory leak? I couldn’t spot one….

  • Federico says:

    Hi Sam,

    Actually, there wasn’t a leak! I was using my own class with code that contained leaks and I wrongly believed that the problem was in the decorator. But after sorting out the client code, I have tested the MDBufferedInputStream on 57Mb of text and not a byte was leaked ;-)

  • mememe says:

    It always ends in a Program received signal: “EXC_BAD_ACCESS”.

  • Federico says:

    Can you post a snippet of the code you’re running?

 

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*