Datasets these days are getting larger and and so is hardware in keeping with Moore’s law. However, Moore’s law applies to population trends. For a given user, budget constraints create a situation where hardware resources increase as a step function. On the other hand, the growth function for datasizes can be approximated with a much smoother exponential. This keeps users typically chasing their own hyper-personal version of big data.
I recently started using lua and torch to learn about neural networks. I wanted to start simple and build a classifier for a kaggle competition. However, the data for this competition is too big for my machine with a measly 16 gigs of RAM. [This used to be a luxury only half a decade ago.]
So, after some digging around on github, I figured how to train with fairly large datasets with stock torch infrastructure and I’m gonna show you how and why it works.
How did you do that?
Hold up, what all did you try?
csv2tensor
The file that I have is 2 gigs on disk. So, my first attempt, obviously, was to throw caution to the wind and attempt to load the dataset anyway. To this end, I tried csv2tensor. This was a no-go right from the get-go (you see what I did there?).
This library converts the columns in the data into lua tables
and then converts those tables
to torch Tensors
. The rub is that tables in lua are not allocated in user memory but in the compiler’s memory which has a small upper-bound because in luajit that torch is typically compiled against. This means that using this library will blow up for any dataset which expands to a more than a gig or so.
csvigo
csvigo
is the standard library for working with csv files in lua / torch7. Again, my first attempt was to read the whole data into memory with this package. I opened top
in a split and proceeded to load the dataset. While this library did not run into the issue above (I think because it uses Tensors
directly which are allocated in user-space), it quickly ate up all the available memory and a few gigs of swap before I killed the process. I tried to setdefaulttensortype
to torch.FloatTensor
which I killed after waiting for under a minute by when it had clocked 7 gigs.
The solution: csvigo with mode=large
At this point, I looked at the csvigo documentation again and found the large-mode option. I decided to try it and was able to successfully put together this solution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
With this method I was able to use the dataset with only two gigs of memory allocated to the process. Perhaps, in another post I will benchmark the performance penalty for doing this when you do have sufficient RAM. I have also not figured out how this will need to be adapted to GPUs. But this is definitely better compared to the option of not using a moderately large dataset at all.
What’s the magic sauce? o.O
Note how the documentation modestly states that the efficiency is under the hood? Which is why we need to use the source, Luke. So, let’s look at this (incomplete) code snippet and comprehend:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
-
Loading to ByteStorage: The first interesting thing that the code does is to load the file as raw bytes. This is the part of code spanning
torch.DiskFile(path, 'r'):binary()
tof:close()
. The intermediate steps are only to calculate the length of the content in bytes. This makes the data take only as much space in memory as it does on the disk (remember how my 2GB file took only 2 gigs in RAM?). -
Calculating the byte offset and the byte length of each new record in the file: The
libcsvigo.create_lookup(data)
essentially scans the byte representation of the file and records at what offset from the beginning is the first byte of each record. Additionally, it also stores how many bytes are in each record – the byte length. These are stored as pairs in thelookup
table. -
Accessing the
i
th record in the file: When it’s time to access thei
th row in the file, the functionindex
creates a string by reading inlookup[i][2]
number of bytes starting fromlookup[i][1]
. Thefromcsv
function then splits this record string into atable
by splitting it on the separator and returns it us.
This allows us to load large datasets in memory employing only as much memory as is the size of the file on disk (in bytes) with the additional memory cost of creating the LongTensor
lookup table. There is, obviously, the additional CPU cost of processing each record as many times as it is read.