Skip to content

[R-Forge #5159] Matching on Non-ASCII characters #69

@arunsrinivasan

Description

@arunsrinivasan

Submitted by: Stefan Fritsch; Assigned to: Arun ; R-Forge link

Hi,

I couldn't find a bug report for the general problem of matching character vectors with different encodings, so I thought I might open one.

Technically this doesn't have to be a bug (as you're comparing different vectors) but encoding is otherwise handled transparently in R and there is absolutely no indication of this problem to the user whatsoever and it often leads to massive and almost unnoticeable errors.

Imho there should be at least a warning. The code for reproduction is below.

Thanks for your time. =)

dt<-data.table(a,b=1:4,key="a")
df<-data.frame(a,b=1:4)
rownames(df)<-df$a

a==au
[1] TRUE TRUE TRUE TRUE

df[au,]
a b
a a 1
ä ä 2
ß ß 3
z z 4
dt[au]
a b
1: a 1
2: ä NA
3: ß NA
4: z 4

merge(df,data.frame(a=au),by="a")
a b
1 a 1
2 ä 2
3 ß 3
4 z 4
merge(dt,data.table(a=au),by="a")
a b
1: a 1
2: z 4

match(a,au)
[1] 1 2 3 4
chmatch(a,au)
[1] 1 NA NA 4

Code for reproduction

Repository/R-Forge/Revision: 1046

library(data.table)

a<-c("a","ä","ß","z")

In my case the Encoding is latin1 and

I change au to UTF;

if you're on Linux you probably need to

do it the other way around.

Encoding(a)
au<-iconv(a,"latin1","UTF8")

au<-iconv(a,"UTF8","latin1")

dt<-data.table(a,b=1:4)
df<-data.frame(a,b=1:4)
rownames(df)<-df$a

a==au

df[au,]
setkey(dt,a)
dt[au]

merge(df,data.frame(a=au),by="a")
merge(dt,data.table(a=au),by="a")

match(a,au)
chmatch(a,au)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions