Not valid characters - how to fix?

Hello

 

in some TXT files used to fed BoardI have some characters that are not recognized.

I know that I can do something in the source data but for now I jsut want to fix the issue replacing such characters.

 

Setting a formula in the ETL is seems working but when I run the data reader it does not work

see the picture

can someone help?

Answers

  • Johannes Blischke
    edited March 2020

    Hi Daniele Di Lorenzo,

     

    you can set the encoding of your csv files directly in the datareader settings:

    image

     

    Try to find the encoding of your files and the characters should then display correctly (however, the prerequisite for this is that the characters are displayed correctly in your source file).

     

    Cheers

    Johannes

  • Unknown
    Unknown Active Partner

    Hi Daniele,

     

    I don't have time to test myself, right at the moment - but I'm more inclined to think that you'll need to use the CHAR() function, rather than (as I'm assuming you've done) copying and pasting the funny character and putting it inside double-quotes.  I don't know how many different invalid characters you potentially have ..... it could become quite a large number of nested SUBSTITUTE() formulas to remove all of the possible characters that need to be stripped out

  • Thank you Johannes Blischke and Craig Parris

     

    I know that I have to work on source data readers playing with encoding setup but unfortunately I have multiple sources from txt files and it requires too much work at the moment. Moreover in the future such files have to replaced with some obdc connection to a sql db so for sure I solve these issues.

     

    But now I was searching a workaround for characters that are already "not identified" in board and appears as <?>.

    I print the tree or entity affected and re-reading it I was trying to repair the character, but without success (as explained above)

     

    but, as far I can see, seems not possible 

  • Ciao Daniele Di Lorenzo,

     

    As already suggested you can control the encoding of the file in the datareader protocol itself. The ETL will inherit the encoding of the file when executing so it wont be able to execute the formula properly in case of a mismatch, like in your specific case.

    When having to deal with multiple files without knowing the source encoding, the best thing you can do is batch-convert them all into UTF8 and set all your readers to UTF 8, if you are on premise.

    On the cloud you can still do this via an ETL tool or copying, converting and copying back via AZCopy tool

     

    You can run this simple powershell command on a file

    Get-Content .\test.txt | Set-Content -Encoding utf8 test-utf8.txt

    This will convert any text file to a utf8 encoded file, whatever source encoding they have

     

    Hope this helps 
    Michele

  • Ciao Michele Roscelli

     

    thanks a lot, I'll try it

     

    Buon 2019