Coffee Space – Coffee Space

Scraper

What?

I decided to write a quick dataset creation tool, the idea being to train a neural network on a wide range of data. Essentially you put in a number of images for it to try and source, several queries of interest and then let it do its thing. I did originally want to use DuckDuckGo, but the search engine requires javascript to view images, so nope. I used Yahoo images many years ago for a similar project and yet again it proves valuable for scraping.

The code itself is very simple and inefficient, it’s literally just a dirty hack in order to scrape for images. One caveat is that urllib doesn’t seem to want to play with HTTPS, so a number of web servers simply block HTTP requests. Additionally, it seems a number of images arrive corrupted somehow and my image reader refuses to display them - it’s probably urllib doing something strange, because they all work fine in several browsers before downloading.

In the future it should:

Check that a valid image was in fact downloaded
Find more images to meet the requirements
Perform some proper checks to make sure the image isn’t already downloaded (repeats can appear in searches)

It’s basic, it works, don’t sue me.

Code

The following is the source, use at your own peril:

0001 # scrape.py by B[]
0002 #
0003 # Created Sep 2019. Last updated Sep 2019.
0004 #
0005 # Search Yahoo images for a handful of images using keywords, then download
0006 # them and apply appropriate names.
0007 #
0008 # Warning: Use at your own risk. If downloading lots of data, consider not
0009 # spamming web servers, DNS servers, Yahoo search, etc.
0010 
0011 import sys
0012 import urllib
0013 
0014 # Check that we have arguments to loop
0015 if len(sys.argv) <= 1 :
0016   print("scrape.py <NUM> <SEARCH> [SEARCH] ..")
0017   print("")
0018   print("  NUM")
0019   print("    Number of images to find for each search")
0020   print("")
0021   print("  SEARCH")
0022   print("    Search query to download images for")
0023   sys.exit(0)
0024 # Loop over command line arguments
0025 num = -1
0026 for i in range(len(sys.argv)) :
0027   if sys.argv[i] != "scrape.py" :
0028     # First valid argument is
0029     if num < 0 :
0030       num = int(sys.argv[i])
0031     else :
0032       sys.argv[i].replace(" ", "+")
0033       u = "https://images.search.yahoo.com/search/images?p=" + sys.argv[i]
0034       print("url -> " + u)
0035       f = urllib.urlopen(u)
0036       s = f.read().split("\"")
0037       count = 0
0038       skip = False
0039       # Loop over possible images
0040       for x in range(len(s)) :
0041         # Check for image type
0042         img = ""
0043         if ".jpg" in s[x] :
0044           img = ".jpg"
0045         if ".jpeg" in s[x] :
0046           img = ".jpeg"
0047         if ".png" in s[x] :
0048           img = ".png"
0049         # Make sure we have a valid image
0050         if img != "" and count < num and "\\/" in s[x] :
0051           # Avoid double images
0052           if not skip :
0053             s[x] = s[x].replace("\\", "")
0054             print("Downloading -> " + s[x])
0055             t = urllib.URLopener()
0056             # Attempt to download image and display result
0057             try :
0058               t.retrieve(s[x], sys.argv[i] + "_" + str(count) + img)
0059               count += 1
0060               skip = True
0061               print("  Success")
0062             except :
0063               print("  Failed")
0064           else :
0065             skip = False