I decided to write a quick dataset creation tool, the idea being to train a neural network on a wide range of data. Essentially you put in a number of images for it to try and source, several queries of interest and then let it do its thing. I did originally want to use DuckDuckGo, but the search engine requires javascript to view images, so nope. I used Yahoo images many years ago for a similar project and yet again it proves valuable for scraping.
The code itself is very simple and inefficient, it’s literally just a dirty hack in order to scrape for images. One caveat is that urllib
doesn’t seem to want to play with HTTPS, so a number of web servers simply block HTTP requests. Additionally, it seems a number of images arrive corrupted somehow and my image reader refuses to display them - it’s probably urllib
doing something strange, because they all work fine in several browsers before downloading.
In the future it should:
It’s basic, it works, don’t sue me.
The following is the source, use at your own peril:
0001 # scrape.py by B[] 0002 # 0003 # Created Sep 2019. Last updated Sep 2019. 0004 # 0005 # Search Yahoo images for a handful of images using keywords, then download 0006 # them and apply appropriate names. 0007 # 0008 # Warning: Use at your own risk. If downloading lots of data, consider not 0009 # spamming web servers, DNS servers, Yahoo search, etc. 0010 0011 import sys 0012 import urllib 0013 0014 # Check that we have arguments to loop 0015 if len(sys.argv) <= 1 : 0016 print("scrape.py <NUM> <SEARCH> [SEARCH] ..") 0017 print("") 0018 print(" NUM") 0019 print(" Number of images to find for each search") 0020 print("") 0021 print(" SEARCH") 0022 print(" Search query to download images for") 0023 sys.exit(0) 0024 # Loop over command line arguments 0025 num = -1 0026 for i in range(len(sys.argv)) : 0027 if sys.argv[i] != "scrape.py" : 0028 # First valid argument is 0029 if num < 0 : 0030 num = int(sys.argv[i]) 0031 else : 0032 sys.argv[i].replace(" ", "+") 0033 u = "https://images.search.yahoo.com/search/images?p=" + sys.argv[i] 0034 print("url -> " + u) 0035 f = urllib.urlopen(u) 0036 s = f.read().split("\"") 0037 count = 0 0038 skip = False 0039 # Loop over possible images 0040 for x in range(len(s)) : 0041 # Check for image type 0042 img = "" 0043 if ".jpg" in s[x] : 0044 img = ".jpg" 0045 if ".jpeg" in s[x] : 0046 img = ".jpeg" 0047 if ".png" in s[x] : 0048 img = ".png" 0049 # Make sure we have a valid image 0050 if img != "" and count < num and "\\/" in s[x] : 0051 # Avoid double images 0052 if not skip : 0053 s[x] = s[x].replace("\\", "") 0054 print("Downloading -> " + s[x]) 0055 t = urllib.URLopener() 0056 # Attempt to download image and display result 0057 try : 0058 t.retrieve(s[x], sys.argv[i] + "_" + str(count) + img) 0059 count += 1 0060 skip = True 0061 print(" Success") 0062 except : 0063 print(" Failed") 0064 else : 0065 skip = False