scrapy data flow
# scrapy data flow
The data flow in Scrapy is controlled by the execution engine, and goes like this:
- The Engine gets the initial Requests to crawl from the Spider.
- The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
- The Scheduler returns the next Requests to the Engine.
- The
Engine sends the Requests to the
Downloader, passing through the
Downloader Middlewares (see
process_request()
). - Once the page finishes downloading the
Downloader generates a Response (with that page) and sends it to the Engine, passing through the
Downloader Middlewares (see
process_response()
). - The
Engine receives the Response from the
Downloader and sends it to the
Spider for processing, passing through the
Spider Middleware (see
process_spider_input()
). - The
Spider processes the Response and returns scraped items and new Requests (to follow) to the
Engine, passing through the
Spider Middleware (see
process_spider_output()
). - The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
- The process repeats (from step 3) until there are no more requests from the Scheduler.
# References
# position
https://docs.scrapy.org/en/latest/topics/architecture.html#data-flow