$ scrapy crawl quotes # This command runs the spider with name quotes that we’ve just added, that will send some requests for the quotes.toscrape.com domain. You will get an output similar to this: ... (omitted for brevity) 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) ...
Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as argument.
The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.
>>> response.xpath('//title') [<Selector xpath='//title' data=u'<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() u'Quotes to Scrape'
XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.
这里说XPath是CSSSelector的基础,意思是更强大了
这里讲如何提取最终的目标数据
这篇教程中的目的是从网站中抓取作品和作者数据,我猜是用正则🙄
抓到的网站数据格式如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
<divclass="quote"> <spanclass="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <smallclass="author">Albert Einstein</small> <ahref="/author/Albert-Einstein">(about)</a> </span> <divclass="tags"> Tags: <aclass="tag"href="/tag/change/page/1/">change</a> <aclass="tag"href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <aclass="tag"href="/tag/thinking/page/1/">thinking</a> <aclass="tag"href="/tag/world/page/1/">world</a> </div> </div>
这里重新打开scrapy shell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
$ scrapy shell 'http://quotes.toscrape.com' # 使用CSSSelector获取到了所有的div.quote元素对象 >>> response.css("div.quote") # quote现在就是第一个元素对象 >>> quote=response.css("div.quote")[0] >>> quote <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'> # 然后根据quote的元素去查找下面的元素 >>> title = quote.css("span.text::text").extract_first() >>> title u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d' >>> author = quote.css("small.author::text").extract_first() >>> author 'Albert Einstein' >>> tags = quote.css("div.tags a.tag::text").extract() >>> tags [u'change', u'deep-thoughts', u'thinking', u'world']
1 2 3 4 5 6 7 8 9
>>> for quote in response.css("div.quote"): ... text=quote.css("span.text::text").extract_first() ... author=quote.css("small.author::text").extract_first() ... tags=quote.css("div.tags a.tag::text").extract() ... print(dict(text=text,author=author,tags=tags)) ... {'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'} {'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'} ... a few more of these, omitted for brevity
在Spider中提取数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
import scrapy
classQuotesSpider(scrapy.Spider): name = "quotes_extract" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ]
defparse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }
运行程序看输出,大概就是下面这样
1 2 3 4 5 6 7 8 9 10 11
$ scrapy crawl quotes_extract ... 2017-04-08 22:00:06 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'} 2017-04-08 22:00:06 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'} ... # 输出json文件 $ scrapy crawl quotes_extract -o quotes_extract.json # 输出jl文件 $ scrapy crawl quotes_extract -o quotes_extract.jl
The JSON Lines format is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like JQ to help doing that at the command-line.
这里终端和上面的输出是一样的,只是把数据保存在了json文件里,随便列几条看看效果就行了
1 2 3 4 5 6 7 8 9
[ {"text":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d","author":"Albert Einstein","tags":["change","deep-thoughts","thinking","world"]}, {"text":"\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d","author":"J.K. Rowling","tags":["abilities","choices"]}, {"text":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d","author":"Albert Einstein","tags":["inspirational","life","live","miracle","miracles"]}, {"text":"\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d","author":"Jane Austen","tags":["aliteracy","books","classic","humor"]}, {"text":"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d","author":"Marilyn Monroe","tags":["be-yourself","inspirational"]}, {"text":"\u201cTry not to become a man of success. Rather become a man of value.\u201d","author":"Albert Einstein","tags":["adulthood","success","value"]}, ... ]
In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. A placeholder file for Item Pipelines has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to implement any item pipelines if you just want to store the scraped items.
What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.
Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.