Splitting pandas dataframe column (into two) after the first letter in the cell
up vote
3
down vote
favorite
The problem
I would like to split a column from a pandas dataframe into 2 columns, in the percentage column (see below), each entry starts with a capitalised alphabet character, I would like to split the 'Percentage' column immediately after this letter, with the new column labelled 'Amino Acid'.
Current Code:
import pandas as pd
df = pd.read_csv('foo.csv')
df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))
df.to_csv('bar.csv',index=False)
Example of input data
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Percentage |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
Example of desired output
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Amino Acid | Percentage |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E | is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
python pandas
add a comment |
up vote
3
down vote
favorite
The problem
I would like to split a column from a pandas dataframe into 2 columns, in the percentage column (see below), each entry starts with a capitalised alphabet character, I would like to split the 'Percentage' column immediately after this letter, with the new column labelled 'Amino Acid'.
Current Code:
import pandas as pd
df = pd.read_csv('foo.csv')
df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))
df.to_csv('bar.csv',index=False)
Example of input data
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Percentage |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
Example of desired output
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Amino Acid | Percentage |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E | is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
python pandas
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
The problem
I would like to split a column from a pandas dataframe into 2 columns, in the percentage column (see below), each entry starts with a capitalised alphabet character, I would like to split the 'Percentage' column immediately after this letter, with the new column labelled 'Amino Acid'.
Current Code:
import pandas as pd
df = pd.read_csv('foo.csv')
df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))
df.to_csv('bar.csv',index=False)
Example of input data
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Percentage |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
Example of desired output
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Amino Acid | Percentage |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E | is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
python pandas
The problem
I would like to split a column from a pandas dataframe into 2 columns, in the percentage column (see below), each entry starts with a capitalised alphabet character, I would like to split the 'Percentage' column immediately after this letter, with the new column labelled 'Amino Acid'.
Current Code:
import pandas as pd
df = pd.read_csv('foo.csv')
df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))
df.to_csv('bar.csv',index=False)
Example of input data
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Percentage |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
Example of desired output
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Amino Acid | Percentage |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E | is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
python pandas
python pandas
edited Nov 19 at 7:13
Cœur
17.1k9102140
17.1k9102140
asked Jul 9 at 10:44
Biomage
908
908
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
4
down vote
accepted
Use split
be first whitespace:
df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)
Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27
@Biomage - I think needdf[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46
Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48
1
@Biomage - do you think split by wordarchaea
orbacteria
? Then need.str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49
if I try:df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True)
I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57
|
show 5 more comments
up vote
2
down vote
You can extract the first letter directly:
df['Amino Acid'] = df['Percentage'].str[0]
df['Percentage'] = df['Percentage'].str[1:]
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
Use split
be first whitespace:
df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)
Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27
@Biomage - I think needdf[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46
Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48
1
@Biomage - do you think split by wordarchaea
orbacteria
? Then need.str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49
if I try:df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True)
I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57
|
show 5 more comments
up vote
4
down vote
accepted
Use split
be first whitespace:
df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)
Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27
@Biomage - I think needdf[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46
Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48
1
@Biomage - do you think split by wordarchaea
orbacteria
? Then need.str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49
if I try:df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True)
I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57
|
show 5 more comments
up vote
4
down vote
accepted
up vote
4
down vote
accepted
Use split
be first whitespace:
df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)
Use split
be first whitespace:
df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)
answered Jul 9 at 10:47
jezrael
311k21247322
311k21247322
Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27
@Biomage - I think needdf[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46
Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48
1
@Biomage - do you think split by wordarchaea
orbacteria
? Then need.str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49
if I try:df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True)
I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57
|
show 5 more comments
Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27
@Biomage - I think needdf[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46
Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48
1
@Biomage - do you think split by wordarchaea
orbacteria
? Then need.str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49
if I try:df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True)
I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57
Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27
Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27
@Biomage - I think need
df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46
@Biomage - I think need
df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46
Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48
Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48
1
1
@Biomage - do you think split by word
archaea
or bacteria
? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49
@Biomage - do you think split by word
archaea
or bacteria
? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49
if I try:
df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True)
I get a wrong number of items passed error– Biomage
Jul 9 at 12:57
if I try:
df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True)
I get a wrong number of items passed error– Biomage
Jul 9 at 12:57
|
show 5 more comments
up vote
2
down vote
You can extract the first letter directly:
df['Amino Acid'] = df['Percentage'].str[0]
df['Percentage'] = df['Percentage'].str[1:]
add a comment |
up vote
2
down vote
You can extract the first letter directly:
df['Amino Acid'] = df['Percentage'].str[0]
df['Percentage'] = df['Percentage'].str[1:]
add a comment |
up vote
2
down vote
up vote
2
down vote
You can extract the first letter directly:
df['Amino Acid'] = df['Percentage'].str[0]
df['Percentage'] = df['Percentage'].str[1:]
You can extract the first letter directly:
df['Amino Acid'] = df['Percentage'].str[0]
df['Percentage'] = df['Percentage'].str[1:]
answered Jul 9 at 10:47
jpp
85.5k194897
85.5k194897
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51243702%2fsplitting-pandas-dataframe-column-into-two-after-the-first-letter-in-the-cell%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown