Create new data frames from existing data frame based on unique column values












1












$begingroup$


I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.



company_id    company_score    date_submitted    company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW


My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.



The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.



company_dictionary = {}  
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()


Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?



Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.



    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]

C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):

C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality

C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res

C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]

C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'









share|improve this question











$endgroup$








  • 1




    $begingroup$
    Group by company_id then iterate over the results. Welcome to the site!
    $endgroup$
    – Emre
    Apr 2 '18 at 20:32












  • $begingroup$
    You try to access df['id'] but there is no such column. Did you mean company_id?
    $endgroup$
    – Emre
    Apr 3 '18 at 16:45
















1












$begingroup$


I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.



company_id    company_score    date_submitted    company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW


My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.



The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.



company_dictionary = {}  
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()


Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?



Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.



    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]

C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):

C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality

C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res

C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]

C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'









share|improve this question











$endgroup$








  • 1




    $begingroup$
    Group by company_id then iterate over the results. Welcome to the site!
    $endgroup$
    – Emre
    Apr 2 '18 at 20:32












  • $begingroup$
    You try to access df['id'] but there is no such column. Did you mean company_id?
    $endgroup$
    – Emre
    Apr 3 '18 at 16:45














1












1








1


1



$begingroup$


I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.



company_id    company_score    date_submitted    company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW


My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.



The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.



company_dictionary = {}  
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()


Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?



Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.



    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]

C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):

C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality

C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res

C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]

C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'









share|improve this question











$endgroup$




I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id (string) and company_score (float). There are approximately 10,000 unique company_id's.



company_id    company_score    date_submitted    company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW


My goal is to create approximately 10,000 new dataframes, by unique company_id, with only the relevant rows in that data frame.



The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.



company_dictionary = {}  
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()


Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?



Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.



    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]

C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):

C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality

C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res

C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]

C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()

pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()

pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()

KeyError: 'id'






python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 3 '18 at 16:43









Aditya

1,4101525




1,4101525










asked Apr 2 '18 at 18:45









ForsakenPlagueForsakenPlague

42118




42118








  • 1




    $begingroup$
    Group by company_id then iterate over the results. Welcome to the site!
    $endgroup$
    – Emre
    Apr 2 '18 at 20:32












  • $begingroup$
    You try to access df['id'] but there is no such column. Did you mean company_id?
    $endgroup$
    – Emre
    Apr 3 '18 at 16:45














  • 1




    $begingroup$
    Group by company_id then iterate over the results. Welcome to the site!
    $endgroup$
    – Emre
    Apr 2 '18 at 20:32












  • $begingroup$
    You try to access df['id'] but there is no such column. Did you mean company_id?
    $endgroup$
    – Emre
    Apr 3 '18 at 16:45








1




1




$begingroup$
Group by company_id then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32






$begingroup$
Group by company_id then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32














$begingroup$
You try to access df['id'] but there is no such column. Did you mean company_id?
$endgroup$
– Emre
Apr 3 '18 at 16:45




$begingroup$
You try to access df['id'] but there is no such column. Did you mean company_id?
$endgroup$
– Emre
Apr 3 '18 at 16:45










2 Answers
2






active

oldest

votes


















2












$begingroup$

You can groupby company_id column and convert its result into a dictionary of DataFrames:



import pandas as pd

df = pd.DataFrame({
"company_id": ["AA", "AB", "AA", "CD", "AB"],
"company_score": [.07, .08, .06, .0003, .09],
"company_region": ["NW", "NE", "NW", "NW", "NE"]})

# Approach 1
dict_of_companies = {k: v for k, v in df.groupby('company_id')}

# Approach 2
dict_of_companies = dict(tuple(df.groupby("company_id")))

import pprint
pprint.pprint(dict_of_companies)


Output:



{'AA':   company_id company_region  company_score
0 AA NW 0.07
2 AA NW 0.06,
'AB': company_id company_region company_score
1 AB NE 0.08
4 AB NE 0.09,
'CD': company_id company_region company_score
3 CD NW 0.0003}





share|improve this answer









$endgroup$













  • $begingroup$
    can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
    $endgroup$
    – A. K.
    Sep 28 '18 at 15:47










  • $begingroup$
    When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
    $endgroup$
    – tuomastik
    Sep 30 '18 at 10:45



















0












$begingroup$

new = old[['A', 'C', 'D']].copy()






share|improve this answer








New contributor




Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29825%2fcreate-new-data-frames-from-existing-data-frame-based-on-unique-column-values%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2












    $begingroup$

    You can groupby company_id column and convert its result into a dictionary of DataFrames:



    import pandas as pd

    df = pd.DataFrame({
    "company_id": ["AA", "AB", "AA", "CD", "AB"],
    "company_score": [.07, .08, .06, .0003, .09],
    "company_region": ["NW", "NE", "NW", "NW", "NE"]})

    # Approach 1
    dict_of_companies = {k: v for k, v in df.groupby('company_id')}

    # Approach 2
    dict_of_companies = dict(tuple(df.groupby("company_id")))

    import pprint
    pprint.pprint(dict_of_companies)


    Output:



    {'AA':   company_id company_region  company_score
    0 AA NW 0.07
    2 AA NW 0.06,
    'AB': company_id company_region company_score
    1 AB NE 0.08
    4 AB NE 0.09,
    'CD': company_id company_region company_score
    3 CD NW 0.0003}





    share|improve this answer









    $endgroup$













    • $begingroup$
      can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
      $endgroup$
      – A. K.
      Sep 28 '18 at 15:47










    • $begingroup$
      When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
      $endgroup$
      – tuomastik
      Sep 30 '18 at 10:45
















    2












    $begingroup$

    You can groupby company_id column and convert its result into a dictionary of DataFrames:



    import pandas as pd

    df = pd.DataFrame({
    "company_id": ["AA", "AB", "AA", "CD", "AB"],
    "company_score": [.07, .08, .06, .0003, .09],
    "company_region": ["NW", "NE", "NW", "NW", "NE"]})

    # Approach 1
    dict_of_companies = {k: v for k, v in df.groupby('company_id')}

    # Approach 2
    dict_of_companies = dict(tuple(df.groupby("company_id")))

    import pprint
    pprint.pprint(dict_of_companies)


    Output:



    {'AA':   company_id company_region  company_score
    0 AA NW 0.07
    2 AA NW 0.06,
    'AB': company_id company_region company_score
    1 AB NE 0.08
    4 AB NE 0.09,
    'CD': company_id company_region company_score
    3 CD NW 0.0003}





    share|improve this answer









    $endgroup$













    • $begingroup$
      can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
      $endgroup$
      – A. K.
      Sep 28 '18 at 15:47










    • $begingroup$
      When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
      $endgroup$
      – tuomastik
      Sep 30 '18 at 10:45














    2












    2








    2





    $begingroup$

    You can groupby company_id column and convert its result into a dictionary of DataFrames:



    import pandas as pd

    df = pd.DataFrame({
    "company_id": ["AA", "AB", "AA", "CD", "AB"],
    "company_score": [.07, .08, .06, .0003, .09],
    "company_region": ["NW", "NE", "NW", "NW", "NE"]})

    # Approach 1
    dict_of_companies = {k: v for k, v in df.groupby('company_id')}

    # Approach 2
    dict_of_companies = dict(tuple(df.groupby("company_id")))

    import pprint
    pprint.pprint(dict_of_companies)


    Output:



    {'AA':   company_id company_region  company_score
    0 AA NW 0.07
    2 AA NW 0.06,
    'AB': company_id company_region company_score
    1 AB NE 0.08
    4 AB NE 0.09,
    'CD': company_id company_region company_score
    3 CD NW 0.0003}





    share|improve this answer









    $endgroup$



    You can groupby company_id column and convert its result into a dictionary of DataFrames:



    import pandas as pd

    df = pd.DataFrame({
    "company_id": ["AA", "AB", "AA", "CD", "AB"],
    "company_score": [.07, .08, .06, .0003, .09],
    "company_region": ["NW", "NE", "NW", "NW", "NE"]})

    # Approach 1
    dict_of_companies = {k: v for k, v in df.groupby('company_id')}

    # Approach 2
    dict_of_companies = dict(tuple(df.groupby("company_id")))

    import pprint
    pprint.pprint(dict_of_companies)


    Output:



    {'AA':   company_id company_region  company_score
    0 AA NW 0.07
    2 AA NW 0.06,
    'AB': company_id company_region company_score
    1 AB NE 0.08
    4 AB NE 0.09,
    'CD': company_id company_region company_score
    3 CD NW 0.0003}






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Apr 3 '18 at 5:24









    tuomastiktuomastik

    749418




    749418












    • $begingroup$
      can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
      $endgroup$
      – A. K.
      Sep 28 '18 at 15:47










    • $begingroup$
      When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
      $endgroup$
      – tuomastik
      Sep 30 '18 at 10:45


















    • $begingroup$
      can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
      $endgroup$
      – A. K.
      Sep 28 '18 at 15:47










    • $begingroup$
      When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
      $endgroup$
      – tuomastik
      Sep 30 '18 at 10:45
















    $begingroup$
    can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
    $endgroup$
    – A. K.
    Sep 28 '18 at 15:47




    $begingroup$
    can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with groupby, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
    $endgroup$
    – A. K.
    Sep 28 '18 at 15:47












    $begingroup$
    When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
    $endgroup$
    – tuomastik
    Sep 30 '18 at 10:45




    $begingroup$
    When you iterate over the groupby object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id and the second item corresponds to a DataFrame containing the rows from the original DataFrame which are specific to that unique company_id.
    $endgroup$
    – tuomastik
    Sep 30 '18 at 10:45











    0












    $begingroup$

    new = old[['A', 'C', 'D']].copy()






    share|improve this answer








    New contributor




    Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$


















      0












      $begingroup$

      new = old[['A', 'C', 'D']].copy()






      share|improve this answer








      New contributor




      Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$
















        0












        0








        0





        $begingroup$

        new = old[['A', 'C', 'D']].copy()






        share|improve this answer








        New contributor




        Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        new = old[['A', 'C', 'D']].copy()







        share|improve this answer








        New contributor




        Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|improve this answer



        share|improve this answer






        New contributor




        Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 1 hour ago









        CoddyCoddy

        1




        1




        New contributor




        Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        Coddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29825%2fcreate-new-data-frames-from-existing-data-frame-based-on-unique-column-values%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ponta tanko

            Tantalo (mitologio)

            Erzsébet Schaár