Top 13 features you need to know about
Nokogiri is a pretty simple and straightforward gem. In the coming few pages we will take a more in-depth look at the most important methods, along with a few useful Ruby methods to take your parsing skills to the next level.
The css method
css(rules) —> NodeSet
The css
method searches self for nodes matching CSS rules and returns an iterable NodeSet. Self is a Nokogiri::HTML::Document
. In contrast to the at_css
method used in the quick start project, the css
method returns all nodes that match the CSS rules. The at_css
method only returns the first node.
The following is a css
method example:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # get all the h3 headings doc.css('h3') # get all the paragraphs doc.css('p'); # get all the unordered lists doc.css('ul') # get all the section/category list items doc.css('.navigationHomeLede li')
There is no explicit output from this code.
For more information refer to the site:
The length method
length —> int
The length
method returns the number of objects in self. Self is an array. length
is one of the standard methods included with Ruby and is not Nokogiri-specific. It is very useful in playing with Nokogiri NodeSets as they extend the array class, meaning you can call length
on them. For example, you can use length
to see how many nodes are matching your CSS rule when using the css
method.
An example of the length
method is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # get all the h3 headings h3_count = doc.css('h3').length puts "h3 count #{h3_count}" # get all the paragraphs p_count = doc.css('p').length; puts "p count #{p_count}" # get all the unordered lists ul_count = doc.css('ul').length; puts "ul count #{ul_count}" # get all the section/category list items # size is an alias for length and may be used interchangeably section_count = doc.css('.navigationHomeLede li').size; puts "section count #{section_count}" Run the above code to see the counts. $ ruby length.rb h3 count 7 p count 47 ul count 64 section count 13
Your counts will be different as this code is running against the live New York Times website.
I use this method most in the IRB shell during the exploration phase. Once you know how large the array is, you can also access individual nodes using the standard array selector:
> doc.css('h3')[2] => #<Nokogiri::XML::Element:0x3fd31ec696f8 name="h3" children=[#<Nokogiri::XML::Element:0x3fd31ec693b0 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fd31ec692c0 name="href" value="http://www.nytimes.com/2013/06/25/world/europe/snowden-case-carries-a-cold-war-aftertaste.html?hp">] children=[#<Nokogiri::XML::Text:0x3fd31ec68924 "\nSnowden Case Has Cold War Aftertaste">]>]>
For more information refer to the site:
The each method 1
each { |item| block } —> ary
The each
method calls the block once for each element in self. Self is a Ruby enumerable object. This method is part of the Ruby standard library and not specific to Nokogiri.
The each
method is useful to iterate over Nokogiri NodeSets:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # iterate through the h3 headings doc.css('h3').each{ |h3| puts h3 }
Run the preceding code to see the following iteration:
$ each.rb <h3><a href="http://dealbook.nytimes.com/2013/06/24/u-s-civil-charges-against-corzine-are-seen-as-near/?hp"> Regulators Are Said to Plan a Civil Suit Against Corzine</a></h3> <h3><a href="http://www.nytimes.com/2013/06/25/business/global/credit-warnings-give-world-a-peek-into-chinas-secretive-banks.html?hp"> Credit Warnings Expose Chinaâs Secretive Banks</a></h3> …
Your output will differ as this is run against the live New York Times website. For more information refer to the site:
The each method 2
each { |key,value| block } —> ary
There is also a Nokogiri native each
method which is called on a single node to iterate over name value pairs in that node. This isn't particularly useful, but we will take a look at an example to help avoid confusion.
The example is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # iterate through key value pairs of an individual node # as we know, the css method returns an enumberable object # so we can access a specific node using standard array syntax doc.css('a')[4].each{ |node_name, node_value| puts "#{node_name}: #{node_value}" }
This shows us the available attributes for the fifth link on the page:
$ nokogiri_each.rb style: display: none; id: clickThru4Nyt4bar1_xwing2 href: http://www.nytimes.com/adx/bin/adx_click.html?type=goto&opzn&page=homepage.nytimes.com/index.html&pos=Bar1&sn2=5b35bc29/49f095e7&sn1=ab317851/c628eac9&camp=nyt2013_abTest_multiTest_anchoredAd_bar1_part2&ad=bar1_abTest_hover&goto=https%3A%2F%2Fwww%2Enytimesathome%2Ecom%2Fhd%2F205%3Fadxc%3D218268%26adxa%3D340400%26page%3Dhomepage.nytimes.com/index.html%26pos%3DBar1%26campaignId%3D3JW4F%26MediaCode%3DWB7AA
Your output will differ as this is run against the live New York Times website.
For more information refer to the site:
The content method
content —> string
The content
method returns the text content of a node. This is how you parse content from a CSS selector. If you used the css
method and have a NodeSet, you will need to iterate with the each
method to extract the content of each node.
The example for the content method is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # iterate through the h3 headings doc.css('h3').each{ |h3| # extract the content from the h3 puts h3.content }
Run the preceding code to see the h3
tags content:
$ content.rb Regulators Are Said to Plan a Civil Suit Against Corzine Credit Warnings Expose China's Secretive Banks Affirmative Action Case Has Both Sides Claiming Victory Back in the News, but Never Off U.S. Radar When Exercise Becomes an Addiction Lee Bollinger: A Long, Slow Drift From Racial Justice
Your output will differ as this is run against the live New York Times website.
For more information refer to the site:
The at_css method
at_css(rules) —> node
The at_css
method searches the document and returns the first node matching the CSS selector. This is useful when you know there is only one match in the DOM or the first match is fine. Because it is able to stop at the first match, at_css
is faster than the naked css
method. Additionally, you don't have to iterate over the object to access its properties.
The example is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # get the content of the title of the page # because there is only one title, we can use at_css puts doc.at_css('title').content
Run the preceding code to parse the title:
$ ruby at_css.rb The New York Times - Breaking News, World News & Multimedia
Your output will likely be the same because it is unlikely that The New York Times has changed their title tag, but it is possible they have updated it.
For more information refer to the site:
The xpath method
xpath(paths) —> NodeSet
The xpath
method searches self for nodes matching XPath rules and returns an iterable NodeSet. Self is a Nokogiri::HTML::Document
or Nokogiri::XML::Document
. The xpath
method returns all nodes that match the XPath rules. The at_xpath
method only returns the first node.
An example use of the xpath
method is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # get all the h3 headings h3_count = doc.xpath('//h3').length puts "h3 count #{h3_count}" # get all the paragraphs p_count = doc.xpath('//p').length; puts "p count #{p_count}" # get all the unordered lists ul_count = doc.xpath('//ul').length; puts "ul count #{ul_count}" # get all the section/category list items # note this rule is substantially different from the CSS. # *[@class="navigationHomeLede"] says to find any node # with the class attribute = navigationHomeLede. We then # have to explicitly search for an unordered list before # searching for list elements. section_count = doc.xpath('//*[@class="navigationHomeLede"]/ul/li').size; puts "section count #{section_count}"
Run the preceding code to see the counts:
$ xpath.rb h3 count 7 p count 47 ul count 64 section count 13
Your counts will differ as this code is running against the live New York Times website, but your counts should be consistent with using the css
method.
For more information refer to the site:
The at_xpath method
at_xpath(paths) —> node
The at_xpath
method searches the document and returns the first node matching the XPath selector. This is useful when you know there is only one match in the DOM or the first match is fine. Because it is able to stop at the first match, at_xpath
is faster than the naked xpath
method. Additionally, you don't have to iterate over the object to access its properties.
An example for the at_xpath
method is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')); # get the content of the title of the page # because there is only one title, we can use at_css puts doc.at_xpath('//title').content
Run the preceding code to parse the title:
$ ruby at_xpath.rb The New York Times - Breaking News, World News & Multimedia
Your output will likely be the same because it is unlikely that The New York Times has changed their title tag, but it is possible they have updated it. Your output however should be the same as at_css
.
For more information refer to the site:
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-at_xpath
The to_s method
to_s —> string
The to_s
method turns self into a string. If self is an HTML document, to_s
returns HTML. If self is an XML document, to_s
returns XML. This is useful in an IRB session where you want to examine the source of a node to determine how to craft your selector or need the raw HTML for your project.
An example of to_s
is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # create a new Nokogiri HTML document from the scraped URL doc = Nokogiri::HTML(open('http://nytimes.com')) # get the HTML for the top story link # if you remember from the quick start, there is only one # of these on the page, so we can us at_css to target. puts doc.at_css('h2 a').to_s
Run the preceding code to see the HTML:
$ ruby to_s.rb <a href="http://www.nytimes.com/2013/06/26/us/politics/obama-plan-to-cut-greenhouse-gases.html?hp">President to Outline Plan on Greenhouse Gas Emissions</a>
Your output will differ as this is run against the live New York Times website, but you should be able to confirm this is indeed the top headline by visiting http://www.nytimes.com in your browser.
For more information refer to the site:
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-to_s
This concludes the base methods you will need to interact with Nokogiri for your scraping and parsing projects. You now know how to target specific content with CSS or XPath selectors, iterate through NodeSets, and extract their content. Next, we will go over a few tips and tricks that will help you should you get into a bind with your Nokogiri project.
Spoofing browser agents
When you request a web page, you send metainformation along with your request in the form of headers. One of these headers, User-agent, informs the web server which web browser you are using. By default open-uri, the library we are using to scrape, will report your browser as Ruby.
There are two issues with this. First, it makes it very easy for an administrator to look through their server logs and see if someone has been scraping the server. Ruby is not a standard web browser. Second, some web servers will deny requests that are made by a nonstandard browsing agent.
We are going to spoof our browser agent so that the server thinks we are just another Mac using Safari.
An example is as follows:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # this string is the browser agent for Safari running on a Mac browser = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1' # create a new Nokogiri HTML document from the scraped URL and pass in the # browser agent as a second parameter doc = Nokogiri::HTML(open('http://nytimes.com', browser)) # you can now go along with your request as normal # you will show up as just another safari user in the logs puts doc.at_css('h2 a').to_s
Caching
It's important to remember that every time we scrape content, we are using someone else's server's resources. While it is true that we are not using any more resources than a standard web browser request, the automated nature of our requests leave the potential for abuse.
In the previous examples we have searched for the top headline on The New York Times website. What if we took this code and put it in a loop because we always want to know the latest top headline? The code would work, but we would be launching a mini denial of service (DOS) attack on the server by hitting their page potentially thousands of times every minute.
Many servers, Google being one example, have automatic blocking set up to prevent these rapid requests. They ban IP addresses that access their resources too quickly. This is known as rate limiting.
To avoid being rate limited and in general be a good netizen, we need to implement a caching layer. Traditionally in a large app this would be implemented with a database. That's a little out of scope for this book, so we're going to build our own caching layer with a simple TXT file. We will store the headline in the file and then check the file modification date to see if enough time has passed before checking for new headlines.
Start by creating the cache.txt
file in the same directory as your code:
$ touch cache.txt
We're now ready to craft our caching solution:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # set how long in minutes until our data is expired # multiplied by 60 to convert to seconds expiration = 1 * 60 # file to store our cache in cache = "cache.txt" # Calculate how old our cache is by subtracting it's modification time # from the current time. # Time.new gets the current time # The mtime methods gets the modification time on a file cache_age = Time.new - File.new(cache).mtime # if the cache age is greater than our expiration time if cache_age > expiration # our cache has expire puts "cache has expired. fetching new headline" # we will now use our code from the quick start to # snag a new headline # scrape the web page data = open('http://nytimes.com') # create a Nokogiri HTML Document from our data doc = Nokogiri::HTML(data) # parse the top headline and clean it up headline = doc.at_css('h2 a').content.gsub(/\n/," ").strip # we now need to save our new headline # the second File.open parameter "w" tells Ruby to overwrite # the old file File.open(cache, "w") do |file| # we then simply puts our text into the file file.puts headline end puts "cache updated" else # we should use our cached copy puts "using cached copy" # read cache into a string using the read method headline = IO.read("cache.txt") end puts "The top headline on The New York Times is ..." puts headline
Our cache is set to expire in one minute, so assuming it has been one minute since you created your cache.txt
file, let's fire up our Ruby script:
$ ruby cache.rb cache has expired. fetching new headline cache updated The top headline on The New York Times is ... Supreme Court Invalidates Key Part of Voting Rights Act
If we run our script again before another minute passes, it should use the cached copy:
$ ruby cache.rb using cached copy The top headline on The New York Times is ... Supreme Court Invalidates Key Part of Voting Rights Act
SSL
By default, open-uri does not support scraping a page with SSL. This means any URL that starts with https will give you an error. We can get around this by adding one line below our require statements:
# import nokogiri to parse and open-uri to scrape require 'nokogiri' require 'open-uri' # disable SSL checking to allow scraping OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
Mechanize
Sometimes you need to interact with a page before you can scrape it. The most common examples are logging in or submitting a form. Nokogiri is not set up to interact with pages. Nokogiri doesn't even scrape or download the page. That duty falls on open-uri. If you need to interact with a page, there is another gem you will have to use: Mechanize.
Mechanize is created by the same team as Nokogiri and is used for automating interactions with websites. Mechanize includes a functioning copy of Nokogiri.
To get started, install the mechanize gem:
$ gem install mechanize Successfully installed mechanize-2.7.1
We're going to recreate the code sample from the installation where we parsed the top Google results for "packt", except this time we are going to start by going to the Google home page and submitting the search form:
# mechanize takes the place of Nokogiri and open-uri require 'mechanize' # create a new mechanize agent # think of this as launching your web browser agent = Mechanize.new # open a URL in your agent / web browser page = agent.get('http://google.com/') # the google homepage has one big search box # if you inspect the HTML, you will find a form with the name 'f' # inside of the form you will find a text input with the name 'q' google_form = page.form('f') # tell the page to set the q input inside the f form to 'packt' google_form.q = 'packt' # submit the form page = agent.submit(google_form) # loop through an array of objects matching a CSS # selector. mechanize uses the search method instead of # xpath or css. search supports xpath and css # you can use the search method in Nokogiri too if you # like it page.search('h3.r').each do |link| # print the link text puts link.content end
Now execute the Ruby script and you should see the titles for the top results:
$ ruby mechanize.rb Packt Publishing: Home Books Latest Books Login/register PacktLib Support Contact Packt - Wikipedia, the free encyclopedia Packt Open Source (PacktOpenSource) on Twitter Packt Publishing (packtpub) on Twitter Packt Publishing | LinkedIn Packt Publishing | Facebook
For more information refer to the site: