Though there are many scraping tools available, one name that keeps coming up is ScrapeBox. ScrapeBox is easy to use, surprisingly powerful, and reasonably priced. With a large community of users it is easy to find great tutorials and assistance. I’ll take you through the basics of how to use ScrapeBox and supply some alternative resources to help you brush up. Our first step will be to harvest some proxies. Let’s get started scraping!
Why Do I Need Proxies?
If Google senses an IP address is using an automated tool to run searches it will attempt to implement a ReCAPTCHA method to slow the process. If it sees the script overcome the ReCAPTCHA more advanced blocking measures such as banning your IP may be taken.
Proxies alleviate this by masking your IP behind theirs. With around 500 or more proxies alternating scraping duty it is unlikely Google will catch on.
How to Use ScrapeBox – Harvesting Proxies
Though it is always recommended to use private proxies, when you’re first learning the ropes of how to use ScrapeBox you may not have made that investment. Luckily ScrapeBox has a feature to harvest public proxies we can use.
Open ScrapeBox and click “Manage” Under the “Select Engines & Proxies” Pane. This will bring up the proxy manager. Simply click “Harvest Proxies”. This will bring up a popup with several default proxy harvesting locations. Click “Start”, It may take a few moments for the process to complete. When it does, click “Apply” and we will return to the Proxy Manager window.
Once our list is complete it is crucial we use the “Test Proxies” dropdown followed by “Test All Proxies”. Testing the proxies can be a very long process, you may want to run it overnight. Test proxies ensures the IP is not Google blacklisted.
Once the test has completed click “Filter” and select “Keep Google Proxies”. Finally, select the “Save” drop down and choose “Save Proxies to a Text File”, then in the save drop down again choose “Save Proxies to Scrapebox”. You now have a functioning proxy list, the first step in how to use ScrapeBox.
Public VS Private Proxies
There are several advantages to using private proxies. Because private proxies are paid for they see a lot less traffic volume, and thus are less likely to be intercepted by Google. Remember there are plenty of other people in the world that know how to use ScrapeBox using the same proxies as you!
Paid proxies are often of a higher speeds than their public brothers, meaning your URL harvesting time will be much shorter. Consider switching to using private proxies as soon as it is financially viable for your campaign.
What is a Footprint?
A footprint is a string of characters used in many places on the internet. Usually these are generated by PHP scripts or CMS defaults. There are two types of footprints we will be covering, “inurl” and “Additive Keywords”. All footprints should be stored in a text file to easily import into your campaigns.
inurl Footprints pertain to common sections of URLs that could be useful for our content. For example, if we wanted to get all vBulletin forums we could use
/forumdisplay.php?. (Make sure you include the leading slash!) Since many CMS have common strings in their URLs you should be able to build quite a few common footprints just studying their configurations across a few sites. This will help you “clean” the footprints by knowing what to expect across many servers and deployments.
One last important point, when using inurl footprints with ScrapeBox make sure Google is the only engine selected in the “Engines & Proxies” section.
inurl Footprints are incredibly powerful at targeting CMS based sites. Study as many different CMS as you can to get a feel for their URL structure for a wide variety of sites.
Additive Words Footprints
Additive words footprints are elements within the page template repeated across the CMS. A great example would be “Proudly powered by WordPress”. Because many templates display this in their footer, we can use this text to source thousands of sites.
With many websites these days opting to use a pre-built commercial template, many template developers include something like “Developed by Sun Studios”. By looking for templates on sites like Theme Forest and adding them to our footprint list could be a major boon and give us some footprints others have not considered!
Don’t be afraid to get unconventional. Many sites use the same login forms in a widget position making for an easy harvest.
Harvesting keywords is a breeze with ScrapeBox. In the main ScrapeBox window click the “Scrape” dropdown button and select “Keyword Scraper”.
Your next window will have two text boxes. Type your main keywords int the left hand box, each newline delimits a new term. Once your keywords are in, click “Scrape”. After a few moments you should be presented with a list of related keywords in the right hand box. Click the “Remove” dropdown and select “Remove Duplicates”. Once this cleans our list we will cut and paste it into the lefthand box, then repeat the process. This will supply us with a large swath of related keyworeds. Click the “Send to ScrapeBox” button and we have captured all the keywords we need.
Now in the main ScrapeBox window we should have our proxy list and keyword list populated. We are almost ready to begin our URL harvesting.
Merging Footprints and Keywords
Now that we have our keyword list we want to merge it with our footprints list from earlier. This allows us to take advantage of advanced Google searches for a higher relevancy yield on our URL harvest. Now you’re really getting a handle on how to use ScrapeBox!
Merging is easy. Click the “M” button in the harvester pane and a browse window will pop up. Select your footprints file from earlier, this will modify the keyword list automatically! If you have yet to create a footprints file hop over to the Matthew Woodward blog and sign up to download his starter ScrapeBox assets.
Start Harvesting URLs
Now all you have to do is click the “Start Harvesting” button under the “URLs Harvested” pane. This is a very long process, especially with public proxies. Run it overnight.
Once URL harvesting has a sufficient amount of results, stop the harvest and remove duplicate URLs. Click the “Check Indexed” button. Check for Google index. After this test completes you should now have a nice clean list of scraped targets.
As with anything this is just a start, check out the video below from SEOSunite. It’s an excellent source and will let you watch the entire process over his shoulder!
Now that you understand the basics of ScrapeBox, it is important you for to beef up your footprints file as best you can. Since we use this file for all our projects it is in many ways the most important step of all we have covered here. With a strong footprints file and good private proxies you should be well on your way to building tiered links.