| Safe Haskell | None |
|---|---|
| Language | Haskell98 |
Network.HTTP.Conduit.Downloader
Description
HTTP downloader tailored for web-crawler needs.
- Handles all possible http-conduit exceptions and returns human readable error messages.
- Handles some web server bugs (returning
deflatedata instead ofgzip, invalidgzipencoding). - Uses OpenSSL instead of
tlspackage (sincetlsdoesn't handle all sites). - Ignores invalid SSL sertificates.
- Receives data in 32k chunks internally to reduce memory fragmentation on many parallel downloads.
- Download timeout.
- Total download size limit.
- Returns HTTP headers for subsequent redownloads
and handles
Not modifiedresults. - Can be used with external DNS resolver (hsdns-cache for example).
- Keep-alive connections pool (thanks to http-conduit).
Typical workflow in crawler:
withDnsCache $ c -> withDownloader $ d -> do
... -- got URL from queue
ra <- resolveA c $ hostNameFromUrl url
case ra of
Left err -> ... -- uh oh, bad host
Right ha -> do
... -- crawler politeness stuff (rate limits, queues)
dr <- download d url (Just ha) opts
case dr of
DROK dat redownloadOptions ->
... -- analyze data, save redownloadOpts for next download
DRRedirect .. -> ...
DRNotModified -> ...
DRError e -> ...
It's highly recommended to use
http://hackage.haskell.org/package/concurrent-dns-cache
(preferably with single resolver pointing to locally running BIND)
for DNS resolution since getAddrInfo used in http-conduit can be
buggy and ineffective when it needs to resolve many hosts per second for
a long time.
- urlGetContents :: String -> IO ByteString
- urlGetContentsPost :: String -> ByteString -> IO ByteString
- download :: Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResult
- downloadG :: (Request -> ResourceT IO Request) -> Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- rawDownload :: (Request -> ResourceT IO Request) -> Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO (DownloadResult, Maybe RawDownloadResult)
- data DownloadResult
- data RawDownloadResult = RawDownloadResult {}
- type DownloadOptions = [String]
- data DownloaderSettings = DownloaderSettings {}
- data Downloader
- withDownloader :: (Downloader -> IO a) -> IO a
- withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a
- newDownloader :: DownloaderSettings -> IO Downloader
- postRequest :: ByteString -> Request -> Request
- sinkByteString :: MonadIO m => Int -> Sink ByteString m (Maybe ByteString)
Download operations
urlGetContents :: String -> IO ByteString Source
Download single URL with default DownloaderSettings.
Fails if result is not DROK.
urlGetContentsPost :: String -> ByteString -> IO ByteString Source
Post data and download single URL with default DownloaderSettings.
Fails if result is not DROK.
Arguments
| :: Downloader | |
| -> String | URL |
| -> Maybe HostAddress | Optional resolved |
| -> DownloadOptions | |
| -> IO DownloadResult |
Perform download
post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResult Source
Perform HTTP POST.
Arguments
| :: (Request -> ResourceT IO Request) | Function to modify |
| -> Downloader | |
| -> String | URL |
| -> Maybe HostAddress | Optional resolved |
| -> DownloadOptions | |
| -> IO DownloadResult |
Generic version of download
with ability to modify http-conduit Request.
Arguments
| :: (Request -> ResourceT IO Request) | Function to modify |
| -> Downloader | |
| -> String | URL |
| -> Maybe HostAddress | Optional resolved |
| -> DownloadOptions | |
| -> IO (DownloadResult, Maybe RawDownloadResult) |
Even more generic version of download, which returns RawDownloadResult.
RawDownloadResult is optional since it can not be determined on timeouts
and connection errors.
data DownloadResult Source
Result of download operation.
Constructors
| DROK ByteString DownloadOptions | Successful download with data and options for next download. |
| DRRedirect String | Redirect URL |
| DRError String | Error |
| DRNotModified | HTTP 304 Not Modified |
Instances
data RawDownloadResult Source
Result of rawDownload operation.
Constructors
| RawDownloadResult | |
Fields | |
Instances
type DownloadOptions = [String] Source
If-None-Match and/or If-Modified-Since headers.
Downloader
data DownloaderSettings Source
Settings used in downloader.
Constructors
| DownloaderSettings | |
Fields
| |
Instances
data Downloader Source
Keeps http-conduit Manager and DownloaderSettings.
withDownloader :: (Downloader -> IO a) -> IO a Source
Create a new Downloader, use it in the provided function,
and then release it.
withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a Source
Create a new Downloader with provided settings,
use it in the provided function, and then release it.
newDownloader :: DownloaderSettings -> IO Downloader Source
Create a Downloader with settings.
Utils
postRequest :: ByteString -> Request -> Request Source
Make HTTP POST request.
sinkByteString :: MonadIO m => Int -> Sink ByteString m (Maybe ByteString) Source
Sink data using 32k buffers to reduce memory fragmentation.
Returns Nothing if downloaded too much data.