How can I remove websites' links from a text?

Question

Dario Borrelli il 1 Feb 2017

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/322851-how-can-i-remove-websites-links-from-a-text

Risposto: Christopher Creutzig il 2 Nov 2017

I am trying to remove websites' links from a string. I would like to remove (or replace with a space ' ') every link that starts with 'https:'. I tried using the command regexprep, but I am able to replace only a specific link.

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Jan il 1 Feb 2017

Please post some relevant part of the text. Is the "https:" included in < and > or in double quotes? Can spaces appear in the links?

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Iddo Weiner il 1 Feb 2017

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/322851-how-can-i-remove-websites-links-from-a-text#answer_252963

Modificato: Iddo Weiner il 1 Feb 2017

Apri in MATLAB Online

Dario, this really depends on what your data looks like. BUT I made an assumption regarding what your text might look like, please check out the following method:

text = 'some words https:link some other words https:otherlink final words';
disp(text)

some words https:link some other words https:otherlink final words

text_copy = text; % work on a copy so you always have the original for comparison
base_string = 'https:';
first_del_idx = strfind(text, base_string); %this is where the link string starts
% find the paired last index for each first index
last_del_idx = nan(size(first_del_idx));
for i = (length(last_del_idx)):-1:1 %the loop works "backwards"
    next_idx = first_del_idx(i) + length(base_string); %no point in checking before this point
    while true
        if strcmp(text_copy(next_idx),' ')==1 || strcmp(text_copy(next_idx),'\'); %guard aginast the possibility of a link in the end of a line
            last_del_idx(i) = next_idx;
            text_copy(first_del_idx(i) : last_del_idx(i)) = []; %this is the actual deletion
            break %out of the while loop
        end
        next_idx = next_idx + 1;
    end
end
% let's see what we're left with
disp(text_copy)

some words some other words final words

Explanation: You might need to adjust a few things in your code, so here's the logic - I assumed you have a base string which could be used to find all link occurrences. I also assumed that links are written without spaces and that a space indicates the end of a link - so if you start running from "https:" and stop when you bump into a space (' '), then you found the full length of the substring that is to be deleted. Now if this is not the situation, you will need a different identifier for the end of a link, maybe '.com' or '/' - I can't know this for sure without seeing your data. There is at least 1 edge-case I could think of that could create bugs in my code - what if the link is at the end of row? In that case instead of ending with a space, it would end with a backslash '\' which would be part of a \n which signifies the beginning of a new line. So I added a condition to protect against this, but then again - your data may not have \n at the end of lines and then we'd have to think of a different identifier for these cases.

There are some principles I highlighted here that might be a little confusing - working with a copy (and not on the original data) is a good coding practice.. And I'd recommend traversing the string backwards so while erasing you don't mix-up the indices, which can cause all kinds of unwanted bugs.

I hope this helps

p.s. I worked here with strfind(), but you could substitute it with regular expression based functions, such as regexp() if you prefer. It's essentially the same in this case.