Unbalanced class: class_weight for ML algorithms in Spark MLLib

In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.

However, I do not find such parameter for the MLLib algorithms. Is there a plan of implementing class_weight for some MLLib algorithm? Or is there any approach in MLLib for unbalanced data? Or we actually have to handle all the up/downsampling ourselves in MLLib?

Thanks!

asked Dec 7 '16 at 0:08

Edamame

5632617

bumped to the homepage by Community♦ 9 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43

$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14

add a comment |

In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.

Thanks!

asked Dec 7 '16 at 0:08

Edamame

5632617

bumped to the homepage by Community♦ 9 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43

$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14

add a comment |

In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.

Thanks!

asked Dec 7 '16 at 0:08

Edamame

5632617

In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.

Thanks!

machine-learning apache-spark unbalanced-classes weighted-data

asked Dec 7 '16 at 0:08

Edamame

5632617

asked Dec 7 '16 at 0:08

Edamame

5632617

asked Dec 7 '16 at 0:08

Edamame

5632617

asked Dec 7 '16 at 0:08

Edamame

5632617

asked Dec 7 '16 at 0:08

Edamame

5632617

bumped to the homepage by Community♦ 9 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 9 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43

$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14

add a comment |

$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43

$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14

Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.

– Samuel Sherman
Jan 6 '17 at 17:43

Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.

– Emre
Oct 3 '17 at 22:14

add a comment |

3 Answers
3

active

oldest

votes

Algorithms in MLLib are always used as baseline in production scenario , and they indeed can not handle some industrial problems , such as label imbalance . So if you want to use them , you have to balance your instances .

Besides , mechanism of BSP in Spark , you can simply see as data parallel , might be the main reason why Spark does not cover that problem . It might be hard for Spark to dispatch instances to all nodes in cluster , while the partial instances of each node share the same label distribution as the whole .

At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .

Hopes this will help you , good luck -)

answered Dec 7 '16 at 3:39

joe

327111

add a comment |

One of the ways I've handled imbalanced classes in the past has been to build a classifier based on a dataset samples to have a 50/50 sample split. This means using all of the data points associated with your minority class, and randomly sampling the same number of data points from your majority class.

Whether this will work depends on how much data you actually have in your minority class -- if you have extreme class imbalance (<5% minority class instances), then you may want to consider synthetic oversampling.

You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

answered Aug 21 '18 at 15:58

ngopal

413

add a comment |

The way I handled class imbalances is by following methods:
1. Merging the class that appear least frequently to other classes. Obviously you should use some kind of domain knowledge instead of merging them randomly
2. Use resampling techniques like oversampling, undersampling, SMOTE, ADASYN. I don't recommend using these techniques because they don't actually represent the actual data. But in any case you can certainly take a look at them

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15573%2funbalanced-class-class-weight-for-ml-algorithms-in-spark-mllib%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .

Hopes this will help you , good luck -)

answered Dec 7 '16 at 3:39

joe

327111

add a comment |

At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .

Hopes this will help you , good luck -)

answered Dec 7 '16 at 3:39

joe

327111

add a comment |

At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .

Hopes this will help you , good luck -)

answered Dec 7 '16 at 3:39

joe

327111

At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .

Hopes this will help you , good luck -)

answered Dec 7 '16 at 3:39

joe

327111

answered Dec 7 '16 at 3:39

joe

327111

answered Dec 7 '16 at 3:39

joe

327111

answered Dec 7 '16 at 3:39

joe

327111

add a comment |

You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

answered Aug 21 '18 at 15:58

ngopal

413

add a comment |

You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

answered Aug 21 '18 at 15:58

ngopal

413

add a comment |

You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

answered Aug 21 '18 at 15:58

ngopal

413

You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

answered Aug 21 '18 at 15:58

ngopal

413

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

edited Aug 21 '18 at 19:36

Stephen Rauch♦

1,52551330

answered Aug 21 '18 at 15:58

ngopal

413

answered Aug 21 '18 at 15:58

ngopal

413

answered Aug 21 '18 at 15:58

ngopal

413

add a comment |

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

add a comment |

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

add a comment |

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

answered Sep 21 '18 at 1:37

Siddhi Kiran Bajracharya

3447

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki