BiobaseFasta-0.4.0.1: streaming FASTA parser
Safe HaskellNone
LanguageHaskell2010

Biobase.Fasta.Streaming

Description

Streaming Fasta handling via the streaming library.

The functions in here should be streaming in constant memory.

A typical, slightly complicated is this: forEach :: forall r . Stream (ByteString m) m r -> m (Stream (Of ()) m r) forEach dna = do -- extract the header, but at most 123 characters, dropping the rest hdr SP.:> dta ← extractHeader (Just 123) dna -- create windows ws of a particular type. Include the prefix, the suffix, and make each window 10 characters long let ws = (streamedWindows True True (Just 10) (SequenceIdentifier hdr) PlusStrand dta :: SP.Stream (SP.Of (BioSequenceWindow DNA DNA 0)) m r) -- count the number of characters in dna, get the return value, print each window count SP.:> r ← SP.mapM_ (liftIO . print) . bswSeqLength $ SP.copy ws liftIO $ print count liftIO $ putStrLn "" -- yield one vacuous () result, return the remainder r from dna. return $ SP.yield () *> return r

TODO Check if this is actually true with some unit tests.

Synopsis

Documentation

streamOfStreamedFasta Source #

Arguments

:: forall m r. Monad m 
=> ByteStream m r 
-> Stream (Stream (ByteStream m) m) m r 

Here each individual fasta file will be a stream.

TODO Once this works, streamingFasta should be S.concats . streamOfStreamedFasta ...

splitFasta :: Monad m => Stream (ByteStream m) m r -> Stream (ByteStream m) m (Stream (ByteStream m) m r) Source #

Given a 'Stream (ByteString m) m r' which is a Stream of lines, split off the first Fasta entry.

collapseData :: Monad m => Stream (ByteStream m) m r -> Stream (ByteStream m) m r Source #

Given a stream, roughly like [BS Header, BS Data1, BS Data2, ...] create a stream like [BS Header, BS Data]. This means that the resulting stream holds exactly two ByteString's.

reChunkBS :: Monad m => Int -> Stream (ByteStream m) m r -> Stream (ByteStream m) m r Source #

Rechunk a stream of bytestrings.

chunksToWindows :: Monad m => SequenceIdentifier w -> Strand -> Stream (ByteStream m) m r -> Stream (Of (Location w FwdPosition (BioSequence ty))) m r Source #

Assuming a "rechunked" stream of bytestrings, create sequence windows.

streamedWindows Source #

Arguments

:: Monad m 
=> Maybe Int 
-> Maybe Int 
-> Maybe Int

desired size or a single huge Fasta entry.

-> SequenceIdentifier w 
-> Strand 
-> Stream (ByteStream m) m r 
-> Stream (Of (PIS w FwdPosition (BioSequence ty))) m r 

Make it possible to take a fasta stream and produce a stream of BioSequenceWindows. This is a convenience function around 'withSuffix . withPrefix . chunksToWindows . reChunks'.

In case of a Nothing window size, a single huge Fasta entry is produced (and materialized!).

TODO In case of Nothing window size, we use the collapseData function which has one check too many, and will be slightly slower. However, the check should be once per ByteString.

streamLocationLength :: (Monad m, ModifyLocation posTy seqTy) => Stream (Of (Location i posTy seqTy)) m r -> m (Of Int r) Source #

Get the full length of a stream of BioSequenceWindows, counted in characters in each bswSequence.

To use, start with bswSeqLength $ SP.copy xs. Then consume this stream normally. It still provides a Stream of BioSequenceWindowss. However, the return type is now not just r, but it provides Int SP.:> r, where the Int provides the total length of characters within this Fasta entry.

This value may then be used to fully update negative strand information.

extractHeader :: Monad m => Maybe Int -> Stream (ByteStream m) m r -> m (Of ByteString (Stream (ByteStream m) m r)) Source #

As a first function, the header should be extracted from a Fasta stream. Since headers may be malformed / malicious, we make it possible to