Create new data frames from existing data frame based on unique column values

I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.

company_id    company_score    date_submitted    company_region

AA            .07              1/1/2017          NW

AB            .08              1/2/2017          NE

CD            .0003            1/18/2017         NW

My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.

The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.

company_dictionary = {}  

for company in df['company_id']:  

    company_dictionary[company_id] = pd.DataFrame()

Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?

Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.

    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)

    [In]  unique_company_id

    [Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',

       'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)

    [In]  for id in unique_company_id:

    [In]      new_df = df[df['id'] == id]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

C:get_loc(self, key, method, tolerance)

   2133             try:

-> 2134                 return self._engine.get_loc(key)

   2135             except KeyError:



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'



During handling of the above exception, another exception occurred:



KeyError                                  Traceback (most recent call last)

<ipython-input-50-dce34398f1e1> in <module>()

      1 for id in unique_bank_id:

----> 2     new_df = df[df['id'] == id]



C: in __getitem__(self, key)

   2057             return self._getitem_multilevel(key)

   2058         else:

-> 2059             return self._getitem_column(key)

   2060 

   2061     def _getitem_column(self, key):



C: in _getitem_column(self, key)

   2064         # get column

   2065         if self.columns.is_unique:

-> 2066             return self._get_item_cache(key)

   2067 

   2068         # duplicate columns & possible reduce dimensionality



C: in _get_item_cache(self, item)

   1384         res = cache.get(item)

   1385         if res is None:

-> 1386             values = self._data.get(item)

   1387             res = self._box_item_values(item, values)

   1388             cache[item] = res



C: in get(self, item, fastpath)

   3541 

   3542             if not isnull(item):

-> 3543                 loc = self.items.get_loc(item)

   3544             else:

   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]



C: in get_loc(self, key, method, tolerance)

   2134                 return self._engine.get_loc(key)

   2135             except KeyError:

-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))

   2137 

   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'

edited Apr 3 '18 at 16:43

Aditya

1,4101525

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

1

$begingroup$
Group by company_id then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32

$begingroup$
You try to access df['id'] but there is no such column. Did you mean company_id?
$endgroup$
– Emre
Apr 3 '18 at 16:45

add a comment |

I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.

company_id    company_score    date_submitted    company_region

AA            .07              1/1/2017          NW

AB            .08              1/2/2017          NE

CD            .0003            1/18/2017         NW

My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.

The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.

company_dictionary = {}  

for company in df['company_id']:  

    company_dictionary[company_id] = pd.DataFrame()

Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?

Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.

    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)

    [In]  unique_company_id

    [Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',

       'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)

    [In]  for id in unique_company_id:

    [In]      new_df = df[df['id'] == id]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

C:get_loc(self, key, method, tolerance)

   2133             try:

-> 2134                 return self._engine.get_loc(key)

   2135             except KeyError:



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'



During handling of the above exception, another exception occurred:



KeyError                                  Traceback (most recent call last)

<ipython-input-50-dce34398f1e1> in <module>()

      1 for id in unique_bank_id:

----> 2     new_df = df[df['id'] == id]



C: in __getitem__(self, key)

   2057             return self._getitem_multilevel(key)

   2058         else:

-> 2059             return self._getitem_column(key)

   2060 

   2061     def _getitem_column(self, key):



C: in _getitem_column(self, key)

   2064         # get column

   2065         if self.columns.is_unique:

-> 2066             return self._get_item_cache(key)

   2067 

   2068         # duplicate columns & possible reduce dimensionality



C: in _get_item_cache(self, item)

   1384         res = cache.get(item)

   1385         if res is None:

-> 1386             values = self._data.get(item)

   1387             res = self._box_item_values(item, values)

   1388             cache[item] = res



C: in get(self, item, fastpath)

   3541 

   3542             if not isnull(item):

-> 3543                 loc = self.items.get_loc(item)

   3544             else:

   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]



C: in get_loc(self, key, method, tolerance)

   2134                 return self._engine.get_loc(key)

   2135             except KeyError:

-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))

   2137 

   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'

edited Apr 3 '18 at 16:43

Aditya

1,4101525

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

1

$begingroup$
Group by company_id then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32

$begingroup$
You try to access df['id'] but there is no such column. Did you mean company_id?
$endgroup$
– Emre
Apr 3 '18 at 16:45

add a comment |

I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.

company_id    company_score    date_submitted    company_region

AA            .07              1/1/2017          NW

AB            .08              1/2/2017          NE

CD            .0003            1/18/2017         NW

My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.

The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.

company_dictionary = {}  

for company in df['company_id']:  

    company_dictionary[company_id] = pd.DataFrame()

Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?

Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.

    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)

    [In]  unique_company_id

    [Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',

       'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)

    [In]  for id in unique_company_id:

    [In]      new_df = df[df['id'] == id]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

C:get_loc(self, key, method, tolerance)

   2133             try:

-> 2134                 return self._engine.get_loc(key)

   2135             except KeyError:



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'



During handling of the above exception, another exception occurred:



KeyError                                  Traceback (most recent call last)

<ipython-input-50-dce34398f1e1> in <module>()

      1 for id in unique_bank_id:

----> 2     new_df = df[df['id'] == id]



C: in __getitem__(self, key)

   2057             return self._getitem_multilevel(key)

   2058         else:

-> 2059             return self._getitem_column(key)

   2060 

   2061     def _getitem_column(self, key):



C: in _getitem_column(self, key)

   2064         # get column

   2065         if self.columns.is_unique:

-> 2066             return self._get_item_cache(key)

   2067 

   2068         # duplicate columns & possible reduce dimensionality



C: in _get_item_cache(self, item)

   1384         res = cache.get(item)

   1385         if res is None:

-> 1386             values = self._data.get(item)

   1387             res = self._box_item_values(item, values)

   1388             cache[item] = res



C: in get(self, item, fastpath)

   3541 

   3542             if not isnull(item):

-> 3543                 loc = self.items.get_loc(item)

   3544             else:

   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]



C: in get_loc(self, key, method, tolerance)

   2134                 return self._engine.get_loc(key)

   2135             except KeyError:

-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))

   2137 

   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'

edited Apr 3 '18 at 16:43

Aditya

1,4101525

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.

company_id    company_score    date_submitted    company_region

AA            .07              1/1/2017          NW

AB            .08              1/2/2017          NE

CD            .0003            1/18/2017         NW

My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.

The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.

company_dictionary = {}  

for company in df['company_id']:  

    company_dictionary[company_id] = pd.DataFrame()

Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?

Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.

    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)

    [In]  unique_company_id

    [Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',

       'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)

    [In]  for id in unique_company_id:

    [In]      new_df = df[df['id'] == id]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

C:get_loc(self, key, method, tolerance)

   2133             try:

-> 2134                 return self._engine.get_loc(key)

   2135             except KeyError:



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'



During handling of the above exception, another exception occurred:



KeyError                                  Traceback (most recent call last)

<ipython-input-50-dce34398f1e1> in <module>()

      1 for id in unique_bank_id:

----> 2     new_df = df[df['id'] == id]



C: in __getitem__(self, key)

   2057             return self._getitem_multilevel(key)

   2058         else:

-> 2059             return self._getitem_column(key)

   2060 

   2061     def _getitem_column(self, key):



C: in _getitem_column(self, key)

   2064         # get column

   2065         if self.columns.is_unique:

-> 2066             return self._get_item_cache(key)

   2067 

   2068         # duplicate columns & possible reduce dimensionality



C: in _get_item_cache(self, item)

   1384         res = cache.get(item)

   1385         if res is None:

-> 1386             values = self._data.get(item)

   1387             res = self._box_item_values(item, values)

   1388             cache[item] = res



C: in get(self, item, fastpath)

   3541 

   3542             if not isnull(item):

-> 3543                 loc = self.items.get_loc(item)

   3544             else:

   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]



C: in get_loc(self, key, method, tolerance)

   2134                 return self._engine.get_loc(key)

   2135             except KeyError:

-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))

   2137 

   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()



pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()



pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()



KeyError: 'id'

python pandas dataframe

edited Apr 3 '18 at 16:43

Aditya

1,4101525

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

edited Apr 3 '18 at 16:43

Aditya

1,4101525

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

edited Apr 3 '18 at 16:43

Aditya

1,4101525

edited Apr 3 '18 at 16:43

Aditya

1,4101525

edited Apr 3 '18 at 16:43

Aditya

1,4101525

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

asked Apr 2 '18 at 18:45

ForsakenPlague

42118

1

$begingroup$
Group by company_id then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32

$begingroup$
You try to access df['id'] but there is no such column. Did you mean company_id?
$endgroup$
– Emre
Apr 3 '18 at 16:45

add a comment |

1

$begingroup$
Group by company_id then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32

$begingroup$
You try to access df['id'] but there is no such column. Did you mean company_id?
$endgroup$
– Emre
Apr 3 '18 at 16:45

Group by company_id then iterate over the results. Welcome to the site!

– Emre
Apr 2 '18 at 20:32

You try to access df['id'] but there is no such column. Did you mean company_id?

– Emre
Apr 3 '18 at 16:45

add a comment |

2 Answers
2

active

oldest

votes

You can groupby company_id column and convert its result into a dictionary of DataFrames:

import pandas as pd



df = pd.DataFrame({

    "company_id": ["AA", "AB", "AA", "CD", "AB"],

    "company_score": [.07, .08, .06, .0003, .09],

    "company_region": ["NW", "NE", "NW", "NW", "NE"]})



# Approach 1

dict_of_companies = {k: v for k, v in df.groupby('company_id')}



# Approach 2

dict_of_companies = dict(tuple(df.groupby("company_id")))



import pprint

pprint.pprint(dict_of_companies)

Output:

{'AA':   company_id company_region  company_score

0         AA             NW           0.07

2         AA             NW           0.06,

 'AB':   company_id company_region  company_score

1         AB             NE           0.08

4         AB             NE           0.09,

 'CD':   company_id company_region  company_score

3         CD             NW         0.0003}

answered Apr 3 '18 at 5:24

tuomastik

749418

$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47

$begingroup$
When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45

add a comment |

new = old[['A', 'C', 'D']].copy()

answered 1 hour ago

Coddy

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29825%2fcreate-new-data-frames-from-existing-data-frame-based-on-unique-column-values%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can groupby company_id column and convert its result into a dictionary of DataFrames:

import pandas as pd



df = pd.DataFrame({

    "company_id": ["AA", "AB", "AA", "CD", "AB"],

    "company_score": [.07, .08, .06, .0003, .09],

    "company_region": ["NW", "NE", "NW", "NW", "NE"]})



# Approach 1

dict_of_companies = {k: v for k, v in df.groupby('company_id')}



# Approach 2

dict_of_companies = dict(tuple(df.groupby("company_id")))



import pprint

pprint.pprint(dict_of_companies)

Output:

{'AA':   company_id company_region  company_score

0         AA             NW           0.07

2         AA             NW           0.06,

 'AB':   company_id company_region  company_score

1         AB             NE           0.08

4         AB             NE           0.09,

 'CD':   company_id company_region  company_score

3         CD             NW         0.0003}

answered Apr 3 '18 at 5:24

tuomastik

749418

$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47

$begingroup$
When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45

add a comment |

You can groupby company_id column and convert its result into a dictionary of DataFrames:

import pandas as pd



df = pd.DataFrame({

    "company_id": ["AA", "AB", "AA", "CD", "AB"],

    "company_score": [.07, .08, .06, .0003, .09],

    "company_region": ["NW", "NE", "NW", "NW", "NE"]})



# Approach 1

dict_of_companies = {k: v for k, v in df.groupby('company_id')}



# Approach 2

dict_of_companies = dict(tuple(df.groupby("company_id")))



import pprint

pprint.pprint(dict_of_companies)

Output:

{'AA':   company_id company_region  company_score

0         AA             NW           0.07

2         AA             NW           0.06,

 'AB':   company_id company_region  company_score

1         AB             NE           0.08

4         AB             NE           0.09,

 'CD':   company_id company_region  company_score

3         CD             NW         0.0003}

answered Apr 3 '18 at 5:24

tuomastik

749418

$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47

$begingroup$
When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45

add a comment |

You can groupby company_id column and convert its result into a dictionary of DataFrames:

import pandas as pd



df = pd.DataFrame({

    "company_id": ["AA", "AB", "AA", "CD", "AB"],

    "company_score": [.07, .08, .06, .0003, .09],

    "company_region": ["NW", "NE", "NW", "NW", "NE"]})



# Approach 1

dict_of_companies = {k: v for k, v in df.groupby('company_id')}



# Approach 2

dict_of_companies = dict(tuple(df.groupby("company_id")))



import pprint

pprint.pprint(dict_of_companies)

Output:

{'AA':   company_id company_region  company_score

0         AA             NW           0.07

2         AA             NW           0.06,

 'AB':   company_id company_region  company_score

1         AB             NE           0.08

4         AB             NE           0.09,

 'CD':   company_id company_region  company_score

3         CD             NW         0.0003}

answered Apr 3 '18 at 5:24

tuomastik

749418

You can groupby company_id column and convert its result into a dictionary of DataFrames:

import pandas as pd



df = pd.DataFrame({

    "company_id": ["AA", "AB", "AA", "CD", "AB"],

    "company_score": [.07, .08, .06, .0003, .09],

    "company_region": ["NW", "NE", "NW", "NW", "NE"]})



# Approach 1

dict_of_companies = {k: v for k, v in df.groupby('company_id')}



# Approach 2

dict_of_companies = dict(tuple(df.groupby("company_id")))



import pprint

pprint.pprint(dict_of_companies)

Output:

{'AA':   company_id company_region  company_score

0         AA             NW           0.07

2         AA             NW           0.06,

 'AB':   company_id company_region  company_score

1         AB             NE           0.08

4         AB             NE           0.09,

 'CD':   company_id company_region  company_score

3         CD             NW         0.0003}

answered Apr 3 '18 at 5:24

tuomastik

749418

answered Apr 3 '18 at 5:24

tuomastik

749418

answered Apr 3 '18 at 5:24

tuomastik

749418

answered Apr 3 '18 at 5:24

tuomastik

749418

$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47

$begingroup$
When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45

add a comment |

$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47

$begingroup$
When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45

can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks

– A. K.
Sep 28 '18 at 15:47

When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.

– tuomastik
Sep 30 '18 at 10:45

add a comment |

new = old[['A', 'C', 'D']].copy()

answered 1 hour ago

Coddy

New contributor

add a comment |

new = old[['A', 'C', 'D']].copy()

answered 1 hour ago

Coddy

New contributor

add a comment |

new = old[['A', 'C', 'D']].copy()

answered 1 hour ago

Coddy

New contributor

new = old[['A', 'C', 'D']].copy()

answered 1 hour ago

Coddy

New contributor

answered 1 hour ago

Coddy

New contributor

answered 1 hour ago

Coddy

answered 1 hour ago

Coddy

New contributor

Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki