Monary+Mongo+Pandas = :)

A lot of people such as myself waste time getting mongo data into numpy or pandas data structures.

You could do it using pymongo. The general process would be to initialize the pymongo driver and make a query, wait for pymongo to convert stuff into lists of son (bson) objects (aka dictionaries), parse the data into arrays, and then copy it into some numpy array. But work’s been done for you already, so why do it again?

Thanks to djcbeach, we have a nifty little module that utilizes mongo’s C driver, the bson C library and python’s ctypes module to load data directly into numpy arrays. Its fast and easy! From there, we can pass this into Wes McKinney’s pandas dataframe and be very, very happy.

Lets look into this, shall we?

  1. Assuming you have numpy already and a mongo server, install Monary. Dont use pip, because the module isn’t even in the cheeseshop.
        $ hg clone ssh:// ./monary
        $ cd ./monary && python install
  1. Make a db conneciton
        $ python
        >>> from monary import Monary
        >>> mon = Monary() # connection to localhost
  1. Make a query and receive numpy arrays
        >>> columns = ['field1', 'field2', 'field3']
        >>> numpy_arrays = mon.query('mydb', 
                        ['int32', 'date', 'string:20'])
  1. For ultimate happiness, pass this into a pandas DataFrame (assuming you’ve also installed pandas)
        >>> import numpy
        >>> import pandas
        >>> df = numpy.matrix(arrs).transpose() 
        >>> df = pandas.DataFrame(df, columns=columns)

I don’t think I could safely do a benchmark comparison to pymongo and not feel stupid about it, but if you were interested in seeing where this process spends most time, check this out:

I put the above code into a function called get_tu which populates 5 columns each with 1,200,000 rows of data (non NAN), and most of the ~2 seconds it took was waiting on mongo. (FYI - I’m using mongo 2.1.3-pre, returns data for this query ~.5 seconds faster than the current stable version of mongo)

        In [53]: %prun -s cumulative main.get_tu()

   342 function calls in 2.221 seconds                                                                               

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.013    0.013    2.221    2.221 <string>:1(<module>)
        1    0.000    0.000    2.208    2.208
        1    1.569    1.569    1.588    1.588
       14    0.616    0.044    0.616    0.044 {numpy.core.multiarray.array}
        1    0.000    0.000    0.616    0.616
        1    0.000    0.000    0.018    0.018
        5    0.014    0.003    0.014    0.003 {numpy.core.multiarray.zeros}
        1    0.000    0.000    0.004    0.004
        5    0.000    0.000    0.004    0.001
        5    0.004    0.001    0.004    0.001 {method 'fill' of 'numpy.ndarray' objects}
GitHub Thingiverse StackOverflow Twitter LinkedIn Facebook

Blog Posts