You Are What You Decide: A Journey in Automation of Our Selves. |
MongoDB collections consists of binary JSON objects, the reading of which in Python is well covered here. However, I did not find a starightforward way to read the JSON objects into DataFrames, so here is one way I had found to complete the task.
Notice, that Pandas Series object behaves very similar to a Python Dict. However, the index of Series, unlike the keys of a dictionary, preserve the order. This implies that we can go through MongoDB, iterate over every JSON, and record the values by sequentially creating hierarchical Series index of tuples.
The idea is that, if, for instance, if we have dictionary-list object like this:
from pandas import DataFrame {1 : [{'keyword1': { 'conversions1': 2, 'cost1': 1}}], 2 : [{'keyword1': { 'value': 3 }, 'keyword2': { 'nan': 4 }}], 3 : [{'keyword2': {'value': 5}}] }
then we can write a loop that converts it into a "dictionary" (albeit with complex ordered keys) like this:
d = {(1, 'keyword1', 'conversions1'): 2, (1, 'keyword1', 'cost1'): 1, (2, 'keyword1', 'value'): 3, (2, 'keyword2', nan): 4, (3, 'keyword2', 'value'): 5}
and then use unstack()
method to create a DataFrame. In reality we don't have a dictionary with controllable order of keys, so we just define list of tuples, and a list of values:
tuples = [(1, 'keyword1', 'conversions1'), (1, 'keyword1', 'cost1'), (2, 'keyword1', 'value'), (2, 'keyword2', nan), (3, 'keyword2', 'value')] values = [1, 2, 3, 4, 5] names = ['dates', 'keywords', 'attributes'] s = DataFrame(values, index=pd.MultiIndex.from_tuples(tuples, names=names))[0]
Then .unstack
the desired slices:
df = s.unstack(['keywords','attributes']) df1 = s.ix[:,'keyword1'].unstack('attributes') df2 = s.ix[:,:,'value'].unstack('keywords')
By the way, from Pandas DataFrame then you can then easily get R data.frame:
import pandas.rpy.common as com r_df = com.convert_to_r_dataframe(df)
Here is the full example code.
Here is a real example from my application:
from pymongo import MongoClient from bson.objectid import ObjectId import pandas as pd # Connecting to database c = MongoClient() # Querying the database query = {} #"account_id" : "XXX-YYY-ZZZZ"} cursor = c.adwords['keywords'].find(query) #.limit(3) source = list(cursor) tuples = [] values = [] names = ['Date', 'AccountId', 'CampaignId', 'AdGroupId', 'KeywordId', 'Attribute'] # To see all attributes: source[0]['keyword_list'][0].keys() for o, observation in enumerate(source): # source[:2] for k, keyword in enumerate(observation['keyword_list']): for a, attr in enumerate(keyword): if attr not in names: tuples.append((observation['_id'].generation_time,\ observation['account_id'],\ keyword['Campaign'], \ keyword['AdGroup'], \ keyword['KeywordId'], \ attr,)) values.append(keyword[attr]) s = pd.DataFrame(values, index=pd.MultiIndex.from_tuples(tuples, names=names))[0]
test
me