A lot of people such as myself waste time getting mongo data into numpy or pandas data structures.
You could do it using pymongo. The general process would be to initialize the pymongo driver and make a query, wait for pymongo to convert stuff into lists of son (bson) objects (aka dictionaries), parse the data into arrays, and then copy it into some numpy array. But work’s been done for you already, so why do it again?
Thanks to djcbeach, we have a nifty little module that utilizes mongo’s C driver, the bson C library and python’s ctypes module to load data directly into numpy arrays. Its fast and easy! From there, we can pass this into Wes McKinney’s pandas dataframe and be very, very happy.
Lets look into this, shall we?
Assuming you have numpy already and a mongo server, install Monary. Dont use pip, because the module isn’t even in the cheeseshop.
Make a db conneciton
Make a query and receive numpy arrays
For ultimate happiness, pass this into a pandas DataFrame (assuming you’ve also installed pandas)
I don’t think I could safely do a benchmark comparison to pymongo and not feel stupid about it, but if you were interested in seeing where this process spends most time, check this out:
I put the above code into a function called get_tu which populates 5 columns each with 1,200,000 rows of data (non NAN), and most of the ~2 seconds it took was waiting on mongo. (FYI - I’m using mongo 2.1.3-pre, returns data for this query ~.5 seconds faster than the current stable version of mongo)