How we did it:
For any feedback, any questions, any notes or just for chat - feel free to follow us on social networks
Provides information on ways to automate online tasks using webbots and spiders, covering such topics as parsing data from Web pages, managing cookies, sending and receiving email, and decoding encrypted files.
Kevin Hemenway, Tara Calishain
The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you. Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you. Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to: Aggregate and associate data from disparate locations, then store and manipulate the data as you like Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites Integrate third-party data into your own applications or web sites Make your own site easier to scrape and more usable to others Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.
The content and services available on the web continue to be accessed mostly through direct human control. But this is changing. Increasingly, users rely on automated agents that save them time and effort by programmatically retrieving content, performing complex interactions, and aggregating data from diverse sources. Programming Spiders, Bots, and Aggregators in Java teaches you how to build and deploy a wide variety of these agents-from single-purpose bots to exploratory spiders to aggregators that present a unified view of information from multiple user accounts. You will quickly build on your basic knowledge of Java to quickly master the techniques that are essential to this specialized world of programming, including parsing HTML, interpreting data, working with cookies, reading and writing XML, and managing high-volume workloads. You'll also learn about the ethical issues associated with bot use--and the limitations imposed by some websites. This book offers two levels of instruction, both of which are focused on the library of routines provided on the companion CD. If your main concern is adding ready-made functionality to an application, you'll achieve your goals quickly thanks to step-by-step instructions and sample programs that illustrate effective implementations. If you're interested in the technologies underlying these routines, you'll find in-depth explanations of how they work and the techniques required for customization.