Safe Haskell | None |
---|
HTTP downloader tailored for web-crawler needs.
- Handles all possible http-conduit exceptions and returns human readable error messages.
- Handles some web server bugs (returning
deflate
data instead ofgzip
). - Ignores invalid SSL sertificates.
- Receives data in 32k blocks internally to reduce memory fragmentation on many parallel downloads.
- Download timeout.
- Total download size limit.
- Returns HTTP headers for subsequent redownloads
and handles
Not modified
results. - Can be used with external DNS resolver (hsdns-cache for example).
- Keep-alive connections pool (thanks to http-conduit).
Typical workflow in crawler:
withDnsCache $ c -> withDownloader $ d -> do ... -- got URL from queue ra <- resolveA c $ hostNameFromUrl url case ra of Left err -> ... -- uh oh, bad host Right ha -> do ... -- crawler politeness stuff (rate limits, domain queues) dr <- download d url (Just ha) opts case dr of DROK dat redownloadOpts -> ... -- analyze data, save redownloadOpts for next download DRRedirect .. -> ... DRNotModified -> ... DRError e -> ...
It's highly recommended to use
http://hackage.haskell.org/package/hsdns-cache
for DNS resolution since getAddrInfo
used in http-conduit
can be
buggy and ineffective when it needs to resolve many hosts per second for
a long time.
- urlGetContents :: String -> IO ByteString
- urlGetContentsPost :: String -> ByteString -> IO ByteString
- download :: Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResult
- downloadG :: (Request (ResourceT IO) -> ResourceT IO (Request (ResourceT IO))) -> Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- data DownloadResult
- type DownloadOptions = [String]
- data DownloaderSettings = DownloaderSettings {}
- data Downloader
- withDownloader :: (Downloader -> IO a) -> IO a
- withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a
- newDownloader :: DownloaderSettings -> IO Downloader
- postRequest :: ByteString -> Request a -> Request b
- sinkByteString :: MonadIO m => Int -> Sink ByteString m (Maybe ByteString)
Download operations
urlGetContents :: String -> IO ByteStringSource
Download single URL with default DownloaderSettings
.
Fails if result is not DROK
.
urlGetContentsPost :: String -> ByteString -> IO ByteStringSource
Post data and download single URL with default DownloaderSettings
.
Fails if result is not DROK
.
:: Downloader | |
-> String | URL |
-> Maybe HostAddress | Optional resolved |
-> DownloadOptions | |
-> IO DownloadResult |
Perform download
post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResultSource
Perform HTTP POST.
:: (Request (ResourceT IO) -> ResourceT IO (Request (ResourceT IO))) | Function to modify |
-> Downloader | |
-> String | URL |
-> Maybe HostAddress | Optional resolved |
-> DownloadOptions | |
-> IO DownloadResult |
Generic version of download
with ability to modify http-conduit Request
.
data DownloadResult Source
Result of download
operation.
DROK ByteString DownloadOptions | Successful download with data and options for next download. |
DRRedirect String | Redirect URL |
DRError String | Error |
DRNotModified | HTTP 304 Not Modified |
type DownloadOptions = [String]Source
If-None-Match
and/or If-Modified-Since
headers.
Downloader
data DownloaderSettings Source
Settings used in downloader.
DownloaderSettings | |
|
data Downloader Source
Keeps http-conduit Manager
and DownloaderSettings
.
withDownloader :: (Downloader -> IO a) -> IO aSource
Create a new Downloader
, use it in the provided function,
and then release it.
withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO aSource
Create a new Downloader
with provided settings,
use it in the provided function, and then release it.
newDownloader :: DownloaderSettings -> IO DownloaderSource
Create a Downloader
with settings.
Utils
postRequest :: ByteString -> Request a -> Request bSource
Make HTTP POST request.
sinkByteString :: MonadIO m => Int -> Sink ByteString m (Maybe ByteString)Source
Sink data using 32k buffers to reduce memory fragmentation.
Returns Nothing
if downloaded too much data.