Submitted by: Stefan Fritsch; Assigned to: Arun ; R-Forge link
Hi,
I couldn't find a bug report for the general problem of matching character vectors with different encodings, so I thought I might open one.
Technically this doesn't have to be a bug (as you're comparing different vectors) but encoding is otherwise handled transparently in R and there is absolutely no indication of this problem to the user whatsoever and it often leads to massive and almost unnoticeable errors.
Imho there should be at least a warning. The code for reproduction is below.
Thanks for your time. =)
dt<-data.table(a,b=1:4,key="a")
df<-data.frame(a,b=1:4)
rownames(df)<-df$a
a==au
[1] TRUE TRUE TRUE TRUE
df[au,]
a b
a a 1
ä ä 2
ß ß 3
z z 4
dt[au]
a b
1: a 1
2: ä NA
3: ß NA
4: z 4
merge(df,data.frame(a=au),by="a")
a b
1 a 1
2 ä 2
3 ß 3
4 z 4
merge(dt,data.table(a=au),by="a")
a b
1: a 1
2: z 4
match(a,au)
[1] 1 2 3 4
chmatch(a,au)
[1] 1 NA NA 4
Code for reproduction
Repository/R-Forge/Revision: 1046
library(data.table)
a<-c("a","ä","ß","z")
In my case the Encoding is latin1 and
I change au to UTF;
if you're on Linux you probably need to
do it the other way around.
Encoding(a)
au<-iconv(a,"latin1","UTF8")
au<-iconv(a,"UTF8","latin1")
dt<-data.table(a,b=1:4)
df<-data.frame(a,b=1:4)
rownames(df)<-df$a
a==au
df[au,]
setkey(dt,a)
dt[au]
merge(df,data.frame(a=au),by="a")
merge(dt,data.table(a=au),by="a")
match(a,au)
chmatch(a,au)
Submitted by: Stefan Fritsch; Assigned to: Arun ; R-Forge link
Hi,
I couldn't find a bug report for the general problem of matching character vectors with different encodings, so I thought I might open one.
Technically this doesn't have to be a bug (as you're comparing different vectors) but encoding is otherwise handled transparently in R and there is absolutely no indication of this problem to the user whatsoever and it often leads to massive and almost unnoticeable errors.
Imho there should be at least a warning. The code for reproduction is below.
Thanks for your time. =)
Code for reproduction
Repository/R-Forge/Revision: 1046
library(data.table)
a<-c("a","ä","ß","z")
In my case the Encoding is latin1 and
I change au to UTF;
if you're on Linux you probably need to
do it the other way around.
Encoding(a)
au<-iconv(a,"latin1","UTF8")
au<-iconv(a,"UTF8","latin1")
dt<-data.table(a,b=1:4)
df<-data.frame(a,b=1:4)
rownames(df)<-df$a
a==au
df[au,]
setkey(dt,a)
dt[au]
merge(df,data.frame(a=au),by="a")
merge(dt,data.table(a=au),by="a")
match(a,au)
chmatch(a,au)