New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compound words in NER #156
Comments
Hello, in this case is because of this function: This function extract where are the word positions based on if it's alphanumeric or not. So perhaps an strategy by language can be implemented, as it's also done in the tokenizers for each language. One question, can you explain what you expect as result in each case? "spider man", "spiderman" and "spider-man". Also, for "spiderman" is returning "spider" because the threshold provided is very low, with a threshold >0.5 will return empty array. |
Hi, I would expect it to return both spider and man. Then you would have better chance to detect whatever new supermen will be invented. However, it's not as practical example than there are in Finnish language: kana-caesarsalaatti = chicken caesar salad I could try to list all salads or salad dressings or just agree that if a compound word ends with salad then it's a salad or if it ends with salad dressing then it's a salad dressing. Then if I have also list of different ingredients I could tell what kind of salad or salad dressing it is. However, it's not possible now.
returns []
returns
But still I don't know what kind of salad it is. I could add savuporosalaatti as a named entity but that's an endless path to take. Just think that the same work has to be done to any kind of dish: bread, porridge, soup, stew, omelette, sushi, burrito... It would be easier to prepare for any kind of dish than to tell people what kind of dish can they have. This gets even trickier because savuporo is smoked reindeer but savu alone is smoke. savustettu is smoked. And almost every meat can be smoked. savustettu lohi ja vihannessalaatti would be salad with smoked salmon and vegetables. savustettu lohi-vihannessalaatti would mean that the whole salmon vegetable salad is smoked which is unusual but I won't prevent you to do that either. Although grammar rules say that correct form is valkosipulisalaattikastike you can sometimes see it written valkosipuli salaattikastike. Therefore, it would be best to have a possibility to find entities with form |
Compound words don't work optimally with NER in nlp.js. For example, according to nlp.js, Spiderman is a spider but not a man. Whitespace seems to have excessive significance. According to nlp.js, Spider-Man or Spider Man is certainly a spider and a man. I don't see the point in this separation. Especially in Finnish language this is a critical issue. We have hell of a lot compoundwords. It would be nice if nlp.js could be configured to behave differently in this case.
spiderman returns
spider man and spider-man both return
The text was updated successfully, but these errors were encountered: