Jump to content

GeoffreyBernardo

New Members
  • Posts

    1
  • Joined

  • Last visited

    Never

Posts posted by GeoffreyBernardo

  1. I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined".

     

    This is the code I use with the DOMDocument object for HTML files not prepared in MS Word:

    <?php
    /* Using the DOMDocument class */
    
    /* Create a new DOMDocument object. */
    $html = new DOMDocument("1.0", "UTF-8");
    
    /* Load HTML code from an HTML file into the DOMDocument. */
    $html->loadHTMLFile("HTML File With Empty Paragraphs.html");
    
    /* Assign all the <p> elements into the $pars DOMNodeList object. */
    $pars = $html->getElementsByTagName("p");
    
    echo "The initial number of paragraphs is " . $pars->length . ".<br />";
    
    /* The trim() function is used to remove leading and trailing spaces as well as
    * newline characters. */
    for ($i = 0; $i < $pars->length; $i++){
        if (trim($pars->item($i)->textContent) == ""){
            $pars->item($i)->parentNode->removeChild($pars->item($i));
            $i--;
        }
    }
    
    echo "The final number of paragraphs is " . $pars->length . ".<br />";
    
    // Write the HTML code back into an HTML file.
    $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
    ?>

     

    This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word:

    <?php
    /* Using simple_html_dom.php */
    
    include("simple_html_dom.php");
    
    $html = file_get_html("HTML File With Empty Paragraphs.html");
    
    $pars = $html->find("p");
    
    for ($i = 0; $i < count($pars); $i++) {
        if (trim($pars[$i]->plaintext) == "") {
            unset($pars[$i]);
            $i--;
        }
    }
    
    $html->save("HTML File without Empty Paragraphs.html");
    ?>

     

    It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext == "")) {".

     

    Does anyone know how I can fix this?

     

    Thank you.

     

    I also asked on stackoverflow.

     

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.