Coffee Space


Listen:

Archiving

Preview Image

Preview Image

Why?

Importance

The first thing that got me thinking about the importance of archiving data is the famous case of the missing Dr Who episodes, of which the BBC routinely deleted archived programs and deciced it wasn't worth archiving old Dr Who episodes anymore [1]. 97 epsiodes of the 253 episodes from the programs first siz year are gone.

The reason these things dissapeared was not due to lack of popularity, lack of archiving capability (VHS) or lack of want to archive. I believe the real reason these were lost is because nobody (at least of significance) ever took it upon themselves to do so. There was possibly some assumption that "it's okay, somebody else will surely have done it", where that simply wasn't true.

For me, there are things I really don't want to lose, such as certain Youtube channels, films, books and various other media. There are some things that I have come across in this life that would simply we a shame to lose, things that I think the future generations of the human race ought to have available to them.

Expectations

People have a certain expectation of archived material, listed below:

  • Material should be easily accessible
  • Material should be in good quality
  • Material should be archived in the various forms it was produced in

Of course, not all of this is possible, as will be discussed in the "Issues" section.

Issues

The following are the predicted issues with the project:

  • Archiving Size - How much can feasibly be archived?
  • Copy Right - A lot of the material may be copyrighted, meaning that it will be a number of years before it is lifted.
  • Future Proof - Given the length of time we will likely be storing this data for, we have to consider that the technology used to archive today may be obselete in the future.
  • Cost - This is clearly a long term investment, something that an investor would be unlikely to want to invest in as the pay off is long in the future. Funding will certainly be a problem as we will discuss later.
  • Searchable - The content must be easily searchable - if you can't find what you're looking for then it defeats the object of the archive.

With these issues taken into consideration, we will attempt to tackle them.

What?

Firstly, the restriction that will be made that only content that is no easily reproducable will be stored. A billion digits of Pi for example would not be considered a good use of archiving resources, whereas the works of Shakespear would be considered worthy of archive.

Text

By far the easiest format to compress, human written text compresses very well due to it's built in redundancy. We should try to encapsulate as much information in this form as possible to increase our ability to compress data.

The information that should be included in this category are:

  • txt files. These can be upgraded to include further formatting that helps the text be more readable.
  • pdf files. Where the PDFs have images, they shall be extracted and stored in the image database.
  • htm and html files. These will be extracted as best as possible, but clearly presentation will be lost.

A proposed format for this data is markdown (md), as discussed in a previous article - an appropriate format for storing data in a near pure format. Some documents can be stored in the standard Unicode format as to include special characters, which may be paramount to the data. If possible, only 7 bit Ascii will be stored.

Images

Images will come in many formats and will be storted in various versions of jpg, to a quality that allows the original meaning to be known. A 4k image of a circle is both easily reproducable and pointless to store in large resolution.

Video

Video is an interesting problem, as it is naturally very large. It needs to be stored in both a future proof format and something that compressed well. It may be worth limitting the video resolution, video colour depth and video sound quality in a controlled mannor. It's bad enough that the quality must be dropped, it makes sense to at least do this in a controlled mannor.

How?

Resources

Internet Connection

For now, with no real need to serve data, a 10Mb connection would suffice. This would easily serve 100 connections at a time given that each connection is restricted to approximately 100kb.

This connection would likely cost £10 per month, or £1200 over 10 years. This is a sizeable cost, but may be reduced if the internet connection also served other purpsoses. This of course also assumes that the cost of a world wide connection does not reduce over time.

Electricity

Each hard drive would approximately need 10W [2], assuming 10 hard drives we would need 100W to serve the hard drives alone. The device to power these hard drives consumes 100W of power [3].

As for the computer, assuming the worst power consumption for any varient of the Raspberry Pi, we could consume as much as 5W [4].

Each hard drive will need a data connector to convert the system to USB [5], with power consumption being roughly equal to that of the hard drive already calculated.

Maintenance

It is expected that initially, this could cost 200 man hours to prepare and later 10 hours per month to maintain. Over 10 years this could cost 1400 man hours.

Hardware

Computer

For computation, it has been decided that the Raspberry Pi would be a good fit as it offers many features, including but not limited to:

  • 100Mb ethernet connection
  • 1 or more low power cores
  • Low power consumption
  • Easily hackable
  • Low form factor

The cost of this device is approximately £30, making it a very affordable option. A USB hub with 10 ports will also need to be purchased for data communication.

Storage

SATA drives have been chosen for this task as the current cheapest form of memory. Currently, these are also well supported and available. It is anticipated that this may not be the case in the future.

10 drives with 4TB each would in theory give you 40TB. Assuming a third redundancy, you may be looking at just over 10TB of safe storage data. Another decision would be to implement a safer file systen, such as ZFS (Oracle), which is supposedly more difficult to corrupt.

Cooling

From past experince, hard drives have an operating temperature of about 50 degrees. Anything hotter than this may have a bad effect on the performance and life span of the disks. It's therefore important to make sure there is enough space for cooling. This requires some kind of rack.

Storage Rack

For the purpose of cost, it makes sense to build a custom storage rack for the hard drives, computer and various power supplies. The drives can be loaded sideways, meaning the drives aren't stacked. This should significantly reduce the heat building up between drives.

The cost should not exceed £50.

Software

Web Server

A simple web server could be used to serve, through the use of nginx. This can be configured to run PHP and connected to an sqlite database. Some roles of the webserver and database combination would be:

  • Storing alternative storage locations for data.
  • Checksums for data to detect corruption.
  • Status of hard drives for loading purposes.
  • Backing up files once a fresh hard drive is loaded. Drive integrity should be checked at this point.
  • Offering some way to search data. This can be made easier by only starting a search if a certain criteria is met, A.K.A a minimum number of characters.
File Storage

In the file storage, the way to get the energy consumption down is to reduce the amount of devices that are on. This could be automated using umount/mount, some relays and some settings in the database.

Conclusion

For 10TB for 10 years, it looks like one could easily sink £5k into the project and a lot of time. Whilst a good cause, at the time of writing it's outside of my budget. Given one or two more years, it's likely that a lot of this technology may be more advanced and cheaper.

References

[1] https://en.wikipedia.org/wiki/Doctor_Who_missing_episodes

[2] http://www.tomshardware.co.uk/forum/267776-32-hard-drive-power-consumption

[3] https://www.amazon.co.uk/Compact-12-Port-Charger-ORICO-DUB-12P/dp/B01IVF3WK0/ref=sr_1_5?ie=UTF8&qid=1483826650&sr=8-5&keywords=100W+usb+hub

[4] https://en.wikipedia.org/wiki/Raspberry_Pi

[5] https://www.amazon.co.uk/Serial-Adapter-Cable-Laptop-Drive/dp/B0182C8Z40/ref=pd_vtph_107_lp_t_3?_encoding=UTF8&psc=1&refRID=HJYX2VNAWAG0AFKEQ8YS