Atjaunināt sīkdatņu piekrišanu

Spidering Hacks [Mīkstie vāki]

3.69/5 (104 ratings by Goodreads)
  • Formāts: Paperback / softback, 420 pages, height x width x depth: 233x153x25 mm, index
  • Sērija : Hacks Ser.
  • Izdošanas datums: 02-Dec-2003
  • Izdevniecība: O'Reilly Media
  • ISBN-10: 0596005776
  • ISBN-13: 9780596005771
Citas grāmatas par šo tēmu:
  • Mīkstie vāki
  • Cena: 28,81 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Standarta cena: 33,90 €
  • Ietaupiet 15%
  • Grāmatu piegādes laiks ir 3-4 nedēļas, ja grāmata ir uz vietas izdevniecības noliktavā. Ja izdevējam nepieciešams publicēt jaunu tirāžu, grāmatas piegāde var aizkavēties.
  • Daudzums:
  • Ielikt grozā
  • Piegādes laiks - 4-6 nedēļas
  • Pievienot vēlmju sarakstam
  • Formāts: Paperback / softback, 420 pages, height x width x depth: 233x153x25 mm, index
  • Sērija : Hacks Ser.
  • Izdošanas datums: 02-Dec-2003
  • Izdevniecība: O'Reilly Media
  • ISBN-10: 0596005776
  • ISBN-13: 9780596005771
Citas grāmatas par šo tēmu:
Provides techniques on creating spiders and scrapers to retrieve information from Web sites and data sources.

With this crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when one has gone too far: what's acceptable and unacceptable), readers learn how to collect media files and data from databases; how to interpret and understand the data and repurpose it for use in other applications; and even build authorized interfaces to integrate the data into their own content.

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.

Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:

  • Aggregate and associate data from disparate locations, then store and manipulate the data as you like
  • Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
  • Integrate third-party data into your own applications or web sites
  • Make your own site easier to scrape and more usable to others
  • Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.


With this crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when one has gone too far: what's acceptable and unacceptable), readers learn how to collect media files and data from databases; how to interpret and understand the data and repurpose it for use in other applications; and even build authorized interfaces to integrate the data into their own content.
Credits ix
Preface xv
Walking Softly
1(20)
A Crash Course in Spidering and Scraping
1(2)
Best Practices for You and Your Spider
3(4)
Anatomy of an HTML Page
7(3)
Registering Your Spider
10(2)
Preempting Discovery
12(3)
Keeping Your Spider Out of Sticky Situations
15(3)
Finding the Patterns of Identifiers
18(3)
Assembling a Toolbox
21(78)
Perl Modules
22(1)
Resources You May Find Helpful
23(1)
Installing Perl Modules
24(3)
Simply Fetching with LWP::Simple
27(2)
More Involved Requests with LWP::UserAgent
29(1)
Adding HTTP Headers to Your Request
30(2)
Posting Form Data with LWP
32(2)
Authentication, Cookies, and Proxies
34(4)
Handling Relative and Absolute URLs
38(2)
Secured Access and Browser Attributes
40(2)
Respecting Your Scrapee's Bandwidth
42(4)
Respecting robots.txt
46(1)
Adding Progress Bars to Your Scripts
47(6)
Scraping with HTML::TreeBuilder
53(3)
Parsing with HTML::TokeParser
56(3)
WWW::Mechanize 101
59(3)
Scraping with WWW::Mechanize
62(5)
In Praise of Regular Expressions
67(3)
Painless RSS with Template::Extract
70(4)
A Quick Introduction to XPath
74(4)
Downloading with curl and wget
78(2)
More Advanced wget Techniques
80(2)
Using Pipes to Chain Commands
82(4)
Running Multiple Utilities at Once
86(3)
Utilizing the Web Scraping Proxy
89(4)
Being Warned When Things Go Wrong
93(3)
Being Adaptive to Site Redesigns
96(3)
Collecting Media Files
99(42)
Detective Case Study: Newgrounds
99(6)
Detective Case Study: iFilm
105(3)
Downloading Movies from the Library of Congress
108(3)
Downloading Images from Webshots
111(4)
Downloading Comics with dailystrips
115(3)
Archiving Your Favorite Webcams
118(4)
News Wallpaper for Your Site
122(3)
Saving Only POP3 Email Attachments
125(7)
Downloading MP3s from a Playlist
132(5)
Downloading from Usenet with nget
137(4)
Gleaning Data from Databases
141(208)
Archiving Yahoo! Groups Messages with yahoo2mbox
141(2)
Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
143(4)
Gleaning Buzz from Yahoo!
147(3)
Spidering the Yahoo! Catalog
150(7)
Tracking Additions to Yahoo!
157(3)
Scattersearch with Yahoo! and Google
160(4)
Yahoo! Directory Mindshare in Google
164(4)
Weblog-Free Google Results
168(3)
Spidering, Google, and Multiple Domains
171(5)
Scraping Amazon.com Product Reviews
176(2)
Receive an Email Alert for Newly Added Amazon.com Reviews
178(2)
Scraping Amazon.com Customer Advice
180(2)
Publishing Amazon.com Associates Statistics
182(3)
Sorting Amazon.com Recommendations by Rating
185(3)
Related Amazon.com Products with Alexa
188(5)
Scraping Alexa's Competitive Data with Java
193(1)
Finding Album Information with FreeDB and Amazon.com
194(9)
Expanding Your Musical Tastes
203(4)
Saving Daily Horoscopes to Your iPod
207(2)
Graphing Data with RRDTOOL
209(4)
Stocking Up on Financial Quotes
213(4)
Super Author Searching
217(15)
Mapping O'Reilly Best Sellers to Library Popularity
232(3)
Using All Consuming to Get Book Lists
235(6)
Tracking Packages with FedEx
241(2)
Checking Blogs for New Comments
243(5)
Aggregating RSS and Posting Changes
248(7)
Using the Link Cosmos of Technorati
255(4)
Finding Related RSS Feeds
259(11)
Automatically Finding Blogs of Interest
270(3)
Scraping TV Listings
273(4)
What's Your Visitor's Weather Like?
277(4)
Trendspotting with Geotargeting
281(6)
Getting the Best Travel Route by Train
287(3)
Geographic Distance and Back Again
290(6)
Super Word Lookup
296(4)
Word Associations with Lexical Freenet
300(3)
Reformatting Bugtraq Reports
303(5)
Keeping Tabs on the Web via Email
308(6)
Publish IE's Favorites to Your Web Site
314(8)
Spidering GameStop.com Game Prices
322(3)
Bargain Hunting with PHP
325(6)
Aggregating Multiple Search Engine Results
331(4)
Robot Karaoke
335(4)
Searching the Better Business Bureau
339(3)
Searching for Health Inspections
342(3)
Filtering for the Naughties
345(4)
Maintaining Your Collections
349(14)
Using cron to Automate Tasks
349(2)
Scheduling Tasks Without cron
351(4)
Mirroring Web Sites with wget and rsync
355(4)
Accumulating Search Results Over Time
359(4)
Giving Back to the World
363(28)
Using XML::RSS to Repurpose Data
364(4)
Placing RSS Headlines on Your Site
368(3)
Making Your Resources Scrapable with Regular Expressions
371(7)
Making Your Resources Scrapable with a Rest Interface
378(3)
Making Your Resources Scrapable with XML-RPC
381(4)
Creating an IM Interface
385(4)
Going Beyond the Book
389(2)
Index 391
Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love. Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.