Cannot COPY UTF-8 data to ScyllaDB with cqlsh
I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');"
|
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8
already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)
), but it had no effect.
I'm using cqlsh 5.0.1
, installed using pip. (weirdly it was pip install cqlsh==5.0.4
)
I also tried the cqlsh
from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh
made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einpxc3xb6ster ........`
So there we are, xc3
in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8
. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8
. I tried overriding the default with utf-8
but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale
on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py
at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
unicode stdin cqlsh scylla
add a comment |
I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');"
|
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8
already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)
), but it had no effect.
I'm using cqlsh 5.0.1
, installed using pip. (weirdly it was pip install cqlsh==5.0.4
)
I also tried the cqlsh
from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh
made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einpxc3xb6ster ........`
So there we are, xc3
in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8
. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8
. I tried overriding the default with utf-8
but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale
on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py
at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
unicode stdin cqlsh scylla
1. Can you share the schema from Postgresql (inc. the data types) vs. the schema you created on Scylla / Cassandra (inc. the data types). Sometimes some alterations are required. 2. Do you have any NULL values in your data?
– TomerSan
Nov 28 '18 at 10:35
@TomerSan apparently that doesn't matter. I managed to replicate the error now with stuff as simple as thisecho "not important what you write here" | cqlsh --encoding='utf-8' -e 'COPY test.whatever (id, "ö") FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
Only the keyspace should exist. The rest is failing before any structure validation. This is obviously the query parser, since the position of the invalid character is the position of theö
in the column name.
– AlexanderMP
Nov 29 '18 at 13:07
add a comment |
I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');"
|
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8
already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)
), but it had no effect.
I'm using cqlsh 5.0.1
, installed using pip. (weirdly it was pip install cqlsh==5.0.4
)
I also tried the cqlsh
from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh
made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einpxc3xb6ster ........`
So there we are, xc3
in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8
. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8
. I tried overriding the default with utf-8
but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale
on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py
at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
unicode stdin cqlsh scylla
I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');"
|
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8
already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)
), but it had no effect.
I'm using cqlsh 5.0.1
, installed using pip. (weirdly it was pip install cqlsh==5.0.4
)
I also tried the cqlsh
from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh
made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einpxc3xb6ster ........`
So there we are, xc3
in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8
. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8
. I tried overriding the default with utf-8
but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale
on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py
at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
unicode stdin cqlsh scylla
unicode stdin cqlsh scylla
edited Nov 29 '18 at 13:21
AlexanderMP
asked Nov 22 '18 at 16:38
AlexanderMPAlexanderMP
10.8k43252
10.8k43252
1. Can you share the schema from Postgresql (inc. the data types) vs. the schema you created on Scylla / Cassandra (inc. the data types). Sometimes some alterations are required. 2. Do you have any NULL values in your data?
– TomerSan
Nov 28 '18 at 10:35
@TomerSan apparently that doesn't matter. I managed to replicate the error now with stuff as simple as thisecho "not important what you write here" | cqlsh --encoding='utf-8' -e 'COPY test.whatever (id, "ö") FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
Only the keyspace should exist. The rest is failing before any structure validation. This is obviously the query parser, since the position of the invalid character is the position of theö
in the column name.
– AlexanderMP
Nov 29 '18 at 13:07
add a comment |
1. Can you share the schema from Postgresql (inc. the data types) vs. the schema you created on Scylla / Cassandra (inc. the data types). Sometimes some alterations are required. 2. Do you have any NULL values in your data?
– TomerSan
Nov 28 '18 at 10:35
@TomerSan apparently that doesn't matter. I managed to replicate the error now with stuff as simple as thisecho "not important what you write here" | cqlsh --encoding='utf-8' -e 'COPY test.whatever (id, "ö") FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
Only the keyspace should exist. The rest is failing before any structure validation. This is obviously the query parser, since the position of the invalid character is the position of theö
in the column name.
– AlexanderMP
Nov 29 '18 at 13:07
1. Can you share the schema from Postgresql (inc. the data types) vs. the schema you created on Scylla / Cassandra (inc. the data types). Sometimes some alterations are required. 2. Do you have any NULL values in your data?
– TomerSan
Nov 28 '18 at 10:35
1. Can you share the schema from Postgresql (inc. the data types) vs. the schema you created on Scylla / Cassandra (inc. the data types). Sometimes some alterations are required. 2. Do you have any NULL values in your data?
– TomerSan
Nov 28 '18 at 10:35
@TomerSan apparently that doesn't matter. I managed to replicate the error now with stuff as simple as this
echo "not important what you write here" | cqlsh --encoding='utf-8' -e 'COPY test.whatever (id, "ö") FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
Only the keyspace should exist. The rest is failing before any structure validation. This is obviously the query parser, since the position of the invalid character is the position of the ö
in the column name.– AlexanderMP
Nov 29 '18 at 13:07
@TomerSan apparently that doesn't matter. I managed to replicate the error now with stuff as simple as this
echo "not important what you write here" | cqlsh --encoding='utf-8' -e 'COPY test.whatever (id, "ö") FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
Only the keyspace should exist. The rest is failing before any structure validation. This is obviously the query parser, since the position of the invalid character is the position of the ö
in the column name.– AlexanderMP
Nov 29 '18 at 13:07
add a comment |
3 Answers
3
active
oldest
votes
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh
parsed those as ascii and not utf-8
, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh
:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh
understands stdin
as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
15 rows failed because inside each of them, you had a too large batch, the output is a bit confusing
– dyasny
Nov 29 '18 at 20:32
add a comment |
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding
?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I also did some searching on that error code, 0xc3. That reminded me of something. I wondered if you may be suffering from some data somewhere that's not actually UTF-8, but may actually be Unicode. Something's happening in the encoding/decoding. I read this thread about people running into the same error, though obviously for different reasons. But the answers may help you think through where the error may be. stackoverflow.com/questions/24475393/…
– Peter Corless
Nov 27 '18 at 19:46
I'll try the Slack channel, just out of curiosity. I think this isn't an UTF-8 problem, since it's trying out an ASCII codec according to the error. Python can decode the stream with no problems with UTF-8.
– AlexanderMP
Nov 29 '18 at 12:12
add a comment |
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
Thanks for the suggestion. I doubt that psql has anything to do with it, as we use this method, and piping it to and from hundreds of sources, and never had this issue. But I tried it nevertheless, and updated the question.
– AlexanderMP
Nov 29 '18 at 12:10
OK, looks like a python issue to me. Have you tried with a newercqlsh
? I suppose you can get the one that ships with cassandra, or just install the latest one viapip
– dyasny
Nov 29 '18 at 16:10
I got around that bug eventually. Details in my answer. Thank you for your time and effort.
– AlexanderMP
Nov 29 '18 at 16:13
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435179%2fcannot-copy-utf-8-data-to-scylladb-with-cqlsh%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh
parsed those as ascii and not utf-8
, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh
:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh
understands stdin
as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
15 rows failed because inside each of them, you had a too large batch, the output is a bit confusing
– dyasny
Nov 29 '18 at 20:32
add a comment |
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh
parsed those as ascii and not utf-8
, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh
:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh
understands stdin
as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
15 rows failed because inside each of them, you had a too large batch, the output is a bit confusing
– dyasny
Nov 29 '18 at 20:32
add a comment |
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh
parsed those as ascii and not utf-8
, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh
:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh
understands stdin
as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh
parsed those as ascii and not utf-8
, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh
:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh
understands stdin
as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
answered Nov 29 '18 at 16:11
AlexanderMPAlexanderMP
10.8k43252
10.8k43252
15 rows failed because inside each of them, you had a too large batch, the output is a bit confusing
– dyasny
Nov 29 '18 at 20:32
add a comment |
15 rows failed because inside each of them, you had a too large batch, the output is a bit confusing
– dyasny
Nov 29 '18 at 20:32
15 rows failed because inside each of them, you had a too large batch, the output is a bit confusing
– dyasny
Nov 29 '18 at 20:32
15 rows failed because inside each of them, you had a too large batch, the output is a bit confusing
– dyasny
Nov 29 '18 at 20:32
add a comment |
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding
?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I also did some searching on that error code, 0xc3. That reminded me of something. I wondered if you may be suffering from some data somewhere that's not actually UTF-8, but may actually be Unicode. Something's happening in the encoding/decoding. I read this thread about people running into the same error, though obviously for different reasons. But the answers may help you think through where the error may be. stackoverflow.com/questions/24475393/…
– Peter Corless
Nov 27 '18 at 19:46
I'll try the Slack channel, just out of curiosity. I think this isn't an UTF-8 problem, since it's trying out an ASCII codec according to the error. Python can decode the stream with no problems with UTF-8.
– AlexanderMP
Nov 29 '18 at 12:12
add a comment |
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding
?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I also did some searching on that error code, 0xc3. That reminded me of something. I wondered if you may be suffering from some data somewhere that's not actually UTF-8, but may actually be Unicode. Something's happening in the encoding/decoding. I read this thread about people running into the same error, though obviously for different reasons. But the answers may help you think through where the error may be. stackoverflow.com/questions/24475393/…
– Peter Corless
Nov 27 '18 at 19:46
I'll try the Slack channel, just out of curiosity. I think this isn't an UTF-8 problem, since it's trying out an ASCII codec according to the error. Python can decode the stream with no problems with UTF-8.
– AlexanderMP
Nov 29 '18 at 12:12
add a comment |
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding
?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding
?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
answered Nov 27 '18 at 18:35
Peter CorlessPeter Corless
514
514
I also did some searching on that error code, 0xc3. That reminded me of something. I wondered if you may be suffering from some data somewhere that's not actually UTF-8, but may actually be Unicode. Something's happening in the encoding/decoding. I read this thread about people running into the same error, though obviously for different reasons. But the answers may help you think through where the error may be. stackoverflow.com/questions/24475393/…
– Peter Corless
Nov 27 '18 at 19:46
I'll try the Slack channel, just out of curiosity. I think this isn't an UTF-8 problem, since it's trying out an ASCII codec according to the error. Python can decode the stream with no problems with UTF-8.
– AlexanderMP
Nov 29 '18 at 12:12
add a comment |
I also did some searching on that error code, 0xc3. That reminded me of something. I wondered if you may be suffering from some data somewhere that's not actually UTF-8, but may actually be Unicode. Something's happening in the encoding/decoding. I read this thread about people running into the same error, though obviously for different reasons. But the answers may help you think through where the error may be. stackoverflow.com/questions/24475393/…
– Peter Corless
Nov 27 '18 at 19:46
I'll try the Slack channel, just out of curiosity. I think this isn't an UTF-8 problem, since it's trying out an ASCII codec according to the error. Python can decode the stream with no problems with UTF-8.
– AlexanderMP
Nov 29 '18 at 12:12
I also did some searching on that error code, 0xc3. That reminded me of something. I wondered if you may be suffering from some data somewhere that's not actually UTF-8, but may actually be Unicode. Something's happening in the encoding/decoding. I read this thread about people running into the same error, though obviously for different reasons. But the answers may help you think through where the error may be. stackoverflow.com/questions/24475393/…
– Peter Corless
Nov 27 '18 at 19:46
I also did some searching on that error code, 0xc3. That reminded me of something. I wondered if you may be suffering from some data somewhere that's not actually UTF-8, but may actually be Unicode. Something's happening in the encoding/decoding. I read this thread about people running into the same error, though obviously for different reasons. But the answers may help you think through where the error may be. stackoverflow.com/questions/24475393/…
– Peter Corless
Nov 27 '18 at 19:46
I'll try the Slack channel, just out of curiosity. I think this isn't an UTF-8 problem, since it's trying out an ASCII codec according to the error. Python can decode the stream with no problems with UTF-8.
– AlexanderMP
Nov 29 '18 at 12:12
I'll try the Slack channel, just out of curiosity. I think this isn't an UTF-8 problem, since it's trying out an ASCII codec according to the error. Python can decode the stream with no problems with UTF-8.
– AlexanderMP
Nov 29 '18 at 12:12
add a comment |
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
Thanks for the suggestion. I doubt that psql has anything to do with it, as we use this method, and piping it to and from hundreds of sources, and never had this issue. But I tried it nevertheless, and updated the question.
– AlexanderMP
Nov 29 '18 at 12:10
OK, looks like a python issue to me. Have you tried with a newercqlsh
? I suppose you can get the one that ships with cassandra, or just install the latest one viapip
– dyasny
Nov 29 '18 at 16:10
I got around that bug eventually. Details in my answer. Thank you for your time and effort.
– AlexanderMP
Nov 29 '18 at 16:13
add a comment |
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
Thanks for the suggestion. I doubt that psql has anything to do with it, as we use this method, and piping it to and from hundreds of sources, and never had this issue. But I tried it nevertheless, and updated the question.
– AlexanderMP
Nov 29 '18 at 12:10
OK, looks like a python issue to me. Have you tried with a newercqlsh
? I suppose you can get the one that ships with cassandra, or just install the latest one viapip
– dyasny
Nov 29 '18 at 16:10
I got around that bug eventually. Details in my answer. Thank you for your time and effort.
– AlexanderMP
Nov 29 '18 at 16:13
add a comment |
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
answered Nov 27 '18 at 19:21
dyasnydyasny
1558
1558
Thanks for the suggestion. I doubt that psql has anything to do with it, as we use this method, and piping it to and from hundreds of sources, and never had this issue. But I tried it nevertheless, and updated the question.
– AlexanderMP
Nov 29 '18 at 12:10
OK, looks like a python issue to me. Have you tried with a newercqlsh
? I suppose you can get the one that ships with cassandra, or just install the latest one viapip
– dyasny
Nov 29 '18 at 16:10
I got around that bug eventually. Details in my answer. Thank you for your time and effort.
– AlexanderMP
Nov 29 '18 at 16:13
add a comment |
Thanks for the suggestion. I doubt that psql has anything to do with it, as we use this method, and piping it to and from hundreds of sources, and never had this issue. But I tried it nevertheless, and updated the question.
– AlexanderMP
Nov 29 '18 at 12:10
OK, looks like a python issue to me. Have you tried with a newercqlsh
? I suppose you can get the one that ships with cassandra, or just install the latest one viapip
– dyasny
Nov 29 '18 at 16:10
I got around that bug eventually. Details in my answer. Thank you for your time and effort.
– AlexanderMP
Nov 29 '18 at 16:13
Thanks for the suggestion. I doubt that psql has anything to do with it, as we use this method, and piping it to and from hundreds of sources, and never had this issue. But I tried it nevertheless, and updated the question.
– AlexanderMP
Nov 29 '18 at 12:10
Thanks for the suggestion. I doubt that psql has anything to do with it, as we use this method, and piping it to and from hundreds of sources, and never had this issue. But I tried it nevertheless, and updated the question.
– AlexanderMP
Nov 29 '18 at 12:10
OK, looks like a python issue to me. Have you tried with a newer
cqlsh
? I suppose you can get the one that ships with cassandra, or just install the latest one via pip
– dyasny
Nov 29 '18 at 16:10
OK, looks like a python issue to me. Have you tried with a newer
cqlsh
? I suppose you can get the one that ships with cassandra, or just install the latest one via pip
– dyasny
Nov 29 '18 at 16:10
I got around that bug eventually. Details in my answer. Thank you for your time and effort.
– AlexanderMP
Nov 29 '18 at 16:13
I got around that bug eventually. Details in my answer. Thank you for your time and effort.
– AlexanderMP
Nov 29 '18 at 16:13
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435179%2fcannot-copy-utf-8-data-to-scylladb-with-cqlsh%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1. Can you share the schema from Postgresql (inc. the data types) vs. the schema you created on Scylla / Cassandra (inc. the data types). Sometimes some alterations are required. 2. Do you have any NULL values in your data?
– TomerSan
Nov 28 '18 at 10:35
@TomerSan apparently that doesn't matter. I managed to replicate the error now with stuff as simple as this
echo "not important what you write here" | cqlsh --encoding='utf-8' -e 'COPY test.whatever (id, "ö") FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
Only the keyspace should exist. The rest is failing before any structure validation. This is obviously the query parser, since the position of the invalid character is the position of theö
in the column name.– AlexanderMP
Nov 29 '18 at 13:07