Safe Haskell | None |
---|---|
Language | Haskell2010 |
Streaming Fasta handling via the streaming
library.
The functions in here should be streaming in constant memory.
A typical, slightly complicated is this:
forEach :: forall r . Stream (ByteString m) m r -> m (Stream (Of ()) m r)
forEach dna = do
-- extract the header, but at most 123 characters, dropping the rest
hdr SP.:> dta ← extractHeader (Just 123) dna
-- create windows
ws of a particular type. Include the prefix, the suffix, and make each window 10 characters long
let ws = (streamedWindows True True (Just 10) (SequenceIdentifier hdr) PlusStrand dta :: SP.Stream (SP.Of (BioSequenceWindow DNA DNA 0)) m r)
-- count the number of characters in
dna, get the return value, print each window
count SP.:> r ← SP.mapM_ (liftIO . print) . bswSeqLength $ SP.copy ws
liftIO $ print count
liftIO $ putStrLn ""
-- yield one vacuous
() result, return the remainder
r from dna.
return $ SP.yield () *> return r
TODO Check if this is actually true with some unit tests.
Synopsis
- streamedFasta :: Monad m => ByteStream m r -> Stream (Stream (ByteStream m) m) m r
- streamOfStreamedFasta :: forall m r. Monad m => ByteStream m r -> Stream (Stream (ByteStream m) m) m r
- splitFasta :: Monad m => Stream (ByteStream m) m r -> Stream (ByteStream m) m (Stream (ByteStream m) m r)
- collapseData :: Monad m => Stream (ByteStream m) m r -> Stream (ByteStream m) m r
- reChunkBS :: Monad m => Int -> Stream (ByteStream m) m r -> Stream (ByteStream m) m r
- chunksToWindows :: Monad m => SequenceIdentifier w -> Strand -> Stream (ByteStream m) m r -> Stream (Of (Location w FwdPosition (BioSequence ty))) m r
- streamedWindows :: Monad m => Maybe Int -> Maybe Int -> Maybe Int -> SequenceIdentifier w -> Strand -> Stream (ByteStream m) m r -> Stream (Of (PIS w FwdPosition (BioSequence ty))) m r
- streamLocationLength :: (Monad m, ModifyLocation posTy seqTy) => Stream (Of (Location i posTy seqTy)) m r -> m (Of Int r)
- extractHeader :: Monad m => Maybe Int -> Stream (ByteStream m) m r -> m (Of ByteString (Stream (ByteStream m) m r))
Documentation
streamedFasta :: Monad m => ByteStream m r -> Stream (Stream (ByteStream m) m) m r Source #
streamOfStreamedFasta Source #
:: forall m r. Monad m | |
=> ByteStream m r | |
-> Stream (Stream (ByteStream m) m) m r |
Here each individual fasta file will be a stream.
TODO Once this works, streamingFasta
should be S.concats . streamOfStreamedFasta
...
splitFasta :: Monad m => Stream (ByteStream m) m r -> Stream (ByteStream m) m (Stream (ByteStream m) m r) Source #
Given a 'Stream (ByteString m) m r' which is a Stream
of lines
, split
off the first Fasta
entry.
collapseData :: Monad m => Stream (ByteStream m) m r -> Stream (ByteStream m) m r Source #
reChunkBS :: Monad m => Int -> Stream (ByteStream m) m r -> Stream (ByteStream m) m r Source #
Rechunk a stream of bytestrings.
chunksToWindows :: Monad m => SequenceIdentifier w -> Strand -> Stream (ByteStream m) m r -> Stream (Of (Location w FwdPosition (BioSequence ty))) m r Source #
Assuming a "rechunked" stream of bytestrings, create sequence windows.
:: Monad m | |
=> Maybe Int | |
-> Maybe Int | |
-> Maybe Int | desired size or a single huge |
-> SequenceIdentifier w | |
-> Strand | |
-> Stream (ByteStream m) m r | |
-> Stream (Of (PIS w FwdPosition (BioSequence ty))) m r |
Make it possible to take a fasta stream and produce a stream of
BioSequenceWindow
s. This is a convenience function around
'withSuffix . withPrefix . chunksToWindows . reChunks'.
In case of a Nothing
window size, a single huge Fasta
entry is produced
(and materialized!).
TODO In case of Nothing
window size, we use the collapseData
function
which has one check too many, and will be slightly slower. However, the
check should be once per ByteString
.
streamLocationLength :: (Monad m, ModifyLocation posTy seqTy) => Stream (Of (Location i posTy seqTy)) m r -> m (Of Int r) Source #
Get the full length of a stream of BioSequenceWindow
s, counted in
characters in each bswSequence
.
To use, start with bswSeqLength $ SP.copy xs
. Then consume this stream
normally. It still provides a Stream
of BioSequenceWindows
s. However,
the return type is now not just r
, but it provides Int SP.:> r
, where
the Int
provides the total length of characters within this Fasta
entry.
This value may then be used to fully update negative strand information.
extractHeader :: Monad m => Maybe Int -> Stream (ByteStream m) m r -> m (Of ByteString (Stream (ByteStream m) m r)) Source #
As a first function, the header should be extracted from a Fasta
stream. Since headers may be
malformed / malicious, we make it possible to