Not valid characters - how to fix?

Daniele Di Lorenzo

Hello

in some TXT files used to fed BoardI have some characters that are not recognized.

I know that I can do something in the source data but for now I jsut want to fix the issue replacing such characters.

Setting a formula in the ETL is seems working but when I run the data reader it does not work

see the picture

can someone help?

Find more posts tagged with

Accepted answers

All comments

Johannes Blischke

Hi Daniele Di Lorenzo,

you can set the encoding of your csv files directly in the datareader settings:

Try to find the encoding of your files and the characters should then display correctly (however, the prerequisite for this is that the characters are displayed correctly in your source file).

Cheers

Johannes

unknown

Hi Daniele,

I don't have time to test myself, right at the moment - but I'm more inclined to think that you'll need to use the CHAR() function, rather than (as I'm assuming you've done) copying and pasting the funny character and putting it inside double-quotes. I don't know how many different invalid characters you potentially have ..... it could become quite a large number of nested SUBSTITUTE() formulas to remove all of the possible characters that need to be stripped out

Daniele Di Lorenzo

Thank you Johannes Blischke and Craig Parris

I know that I have to work on source data readers playing with encoding setup but unfortunately I have multiple sources from txt files and it requires too much work at the moment. Moreover in the future such files have to replaced with some obdc connection to a sql db so for sure I solve these issues.

But now I was searching a workaround for characters that are already "not identified" in board and appears as <?>.

I print the tree or entity affected and re-reading it I was trying to repair the character, but without success (as explained above)

but, as far I can see, seems not possible

unknown

Ciao Daniele Di Lorenzo,

As already suggested you can control the encoding of the file in the datareader protocol itself. The ETL will inherit the encoding of the file when executing so it wont be able to execute the formula properly in case of a mismatch, like in your specific case.

When having to deal with multiple files without knowing the source encoding, the best thing you can do is batch-convert them all into UTF8 and set all your readers to UTF 8, if you are on premise.

On the cloud you can still do this via an ETL tool or copying, converting and copying back via AZCopy tool

You can run this simple powershell command on a file

Get-Content .\test.txt | Set-Content -Encoding utf8 test-utf8.txt

This will convert any text file to a utf8 encoded file, whatever source encoding they have

Hope this helps
Michele

Daniele Di Lorenzo

Ciao Michele Roscelli

thanks a lot, I'll try it

Buon 2019