Jump to content

Parsing a document into a particular format.


mattal999

Recommended Posts

Hi guys,

 

I am trying to make a simple document parser to make it easy for me to import tests into a new learning system I am creating. I have a document, like so (saved in Plain Text format so that PHP can make sense of it):

 

1. Which of the following descriptions best describes the process of active transport?

(a) It is a physical process that happens inside cells
(b) It is a type of osmosis that occurs mainly in plant cells
(c) It is way that dissolved substances can move that uses energy
(d) It is the way that cells make energy and transport water and starch
(e) It is the process that enables water to move up xylem vessels and phloem tubes

2. Which of the following blood vessels would be expected to have the largest lumen?

(a) Vein
(b) Capillary
(c) Venule
(d) Arteriole
(e) Artery

3. Which of the following microorganisms respires anaerobically to make bread?

(a) Yeast
(b) E. coli
(c) Campylobacter cinaedi
(d) Fusarium
(e) Penicillim notatum

4. Which of the following organisms uses spiracles to breathe?

(a) Tadpoles
(b) Frogs
(c) Fish
(d) Mice
(e) Locusts

5. Which of the following feature is typical of red blood cells?

(a) They can produce antibodies
(b) They are biconcave discs
(c) They perform active transport
(d) They regulate mutate in to other cells types
(e) The ‘disintegrate’ on contact with the air




6. Which is the preferred food source for bacteria during yogurt manufacturing?

(a) Starch
(b) Glycogen
(c) Rennin
(d) Lactose
(e) Amylase

7. Which structures inside the small intestines enable the absorption of digested food?

(a) Villi
(b) Valves
(c) Alveoli
(d) Cardiac muscles
(e) Mitochondria

8. What type of substance is glycogen?

(a) A protein
(b) A fat
(c) A carbohydrate
(d) An amino acid
(e) A urobilin

9. In which type of vessel do we grow microorganisms in a large industrial scale?

(a) Fermenters
(b) Gas syringes
(c) Potometers
(d) Respirometers
(e) Liebig condensers

10. Through which blood vessel does blood enter the heart?

(a) Aorta
(b) Vena cava
(c) Pulmonary artery
(d) Pulmonary vein
(e) Hepatic portal vein

11. Anaerobic respiration results in which two products?

(a) Glucose and energy
(b) Oxygen and energy
(c) Glycgen and Oxygen
(d) Lactic acid and energy
(e) Lactic acid and glucose
12. Mycoprotein can be converted in to many products with many flavours. Which of the following is a typical example of the use of mycoprotein in the food industry?

(a) Adding flavour to yogurts
(b) As a constituent of tomato ketchup
(c) As a cheaper alternative to salad dressings
(d) As a substitute for meat
(e) As a low fat option in mayonnaise

13. Which piece of equipment should you choose if you are to perform an investigation in to the rate of transpiration?

(a) Colorimeter
(b) Potometer
(c) Lux meter
(d) UV lamp
(e) Micro pipette

14. Which blood vessel carries blood away from the kidney?

(a) Renal artery
(b) Hepatic portal vein
(c) Hepatic artery
(d) Renal vein
(e) Ureter

15. Which are the main constituents of biogas?

(a) Nitrogen and methane
(b) Methane and sulphur dioxide
(c) Methane and carbon dioxide
(d) Carbon dioxide and oxygen
(e) Oxygen and carbon dioxide

16. Which types of drug are used after an organ has been received from a donor, in order to prevent the organ from being rejected?

(a) Antibiotics
(b) Anti virals
(c) Immunosupressants
(d) Beta blockers
(e) Inhalers







17. Dialysis machines save lives every day and replace the function of the human kidney. Which substance is completely removed by this treatment?

(a) Glucose
(b) Water
(c) Blood
(d) Sodium
(e) Urea

18. Which 3 environmental conditions would increase the rate of transpiration the most?

(a) Hot, dry, windy
(b) Cold, wet, calm
(c) Cold, dry, windy
(d) Humid, wet, hot
(e) Humid, cloudy, hot





 

As you can see, there are double spaces in different places, making it hard to parse the file correctly. What I am currently doing is opening the file, exploding by two newlines and parsing in the correct format. To get an idea of the output I want to get, see below:

 

1;Which of the following descriptions best describes a hormone?
Electrical signals given out by the body in response to your surroundings.
Chemical and electrical signals made by the brain, which usually act quite quickly.
Chemical messengers made by glands inside the body, which usually act quite slowly.
Feelings that are caused by changes during pregnancy.
Messages that travel along neurones to receptor cells.
A:3;T:Nerves and Communication

2;Why can a cactus be described as a succulent?
It is covered by spikes for protection.
It stores water in its stem, leaves and roots.
It requires very high temperatures.
It rarely flowers.
It is unlikely to be eaten by animals.
A:2;T:Ecology and Adaptations

3;Which of the following conditions does your body not control as part of homeostasis?
Body temperature.
Water and ion balance.
Sugar intake.
Blood sugar levels.
Blood pressure.
A:3;T:Nerves and Communication

4;Which of the following statements is a description of metabolic rate?
The rate at which the chemical reactions inside the cells of your body take place.
The rate at which your heart rate increases due to exercise.
The rate at which your heart rate returns to normal after exercise.
The rate at which you can digest food.
The speed at which your body breaks down harmful chemicals in to useful products.
A:1;T:Health

5;Which of the following properties is not a benefit of eating fat as part of a balanced diet?
It helps you to insulate parts of your body.
It helps you to protect body organs.
It can be used as an energy source.
It helps you to digest other proteins and carbohydrates.
It helps you to digest certain vitamins.
A:4;T:Health

6;Which of the following drugs are recreational drugs? Choose one.
Ecstasy.
Antibiotic.
Solute.
Precipitate.
Nicotine.
A:5;T:Drugs

7;What type of drug is alcohol?
Stimulant.
Barbiturate.
Sedative.
Hallucinogen.
Illegal.
A:3;T:Drugs

8;Which of the following hormones is responsible for building up a thick lining of the female womb during a menstrual cycle?
Follicle stimulating hormone (FSH).
Oestrogen.
Luteinizing hormone (LH).
Progesterone.
Anti-diuretic hormone (ADH).
A:2;T:Nerves and Communication

9;Which chemical associated with smoking, causes your body to reduce the amount of oxygen carried by your blood?
Nicotine.
Tar.
Tobacco.
Carcinogen.
Carbon monoxide.
A:5;T:Drugs

10;Which description best describes the body's first line of defence?
The white blood cells producing antibodies.
The red blood cells carrying oxygen to all body cells.
Platelets disintegrating as they meet the air, forming a scab.
Skin, mucus and platelets acting together to prevent microbes entering the body.
Skin, mucus and white blood cells preventing microbes from causing harm inside the body.
A:4;T:Immune System

11;Who discovered Penicillin?
Michael Faraday.
Edward Jenner.
Louis Pasteur.
Alexander Flemming.
Gregor Mendel.
A:4;T:Immune System

12;Which of the following answers lists the separate parts of a reflex arc in the correct order?
Co-ordinator - Receptor - Stimulus - Response - Effector.
Controller - Receptor - Stimulus - Effector - Response.
Stimulus - Co-ordinator - Effector - Response.
Stimulus - Receptor - Co-ordinator - Effector - Response.
Effector - Communicator - Receptor - Co-ordinator - Response.
A:4;T:Nerves and Communication

13;Which type of pathogen causes influenza?
Virus.
Bacteria.
Fungi.
Protozoa.
Prion.
A:1;T:Immune System

14;How does an antibiotic work?
It damages the cell walls of bacteria so that they can no longer multiply inside us.
It damages the cell walls of viruses so that they can no longer multiply inside us.
It helps white blood cells produce antibodies that fight infection.
It stops us spreading the disease to other people.
It prevents the pathogen entering the blood system.
A:1;T:Immune System

15;How do plants lose water?
Through pores in the leaves called stomata.
Through pits in the leaves.
Through the roots.
Through the roots and stem.
Through xylem cells and phloem tubes.
A:1;T:Ecology and Adaptations

16;Which condition are you likely to suffer from as a result of eating too little vitamin C?
Rickets.
Cancer.
Heart disease.
Pleurisy.
Scurvy.
A:5;T:Health

17;Competition is part of life for all living things. What do plants compete for?
Starch.
Soil.
Seeds.
Sunlight.
Food.
A:5;T:Ecology and Adaptations

18;Which type of reproduction causes the most variation?
Sexual reproduction - as genetic information is mixed from two parents.
Asexual reproduction - as this is like cloning and means that the parents can dictate the variation that occurs.
Sexual reproduction - as there are more animal species that plant species.
Asexual reproduction - as plants can reproduce quicker than animals.
Cloning - as the genetic information from animals or plants can now be recreated by humans.
A:1;T:Genes and Variation

19;Animals compete for many things. Which list best summarises the things that animals compete for?
Territory - Sunlight - Food - Mates.
Territory - Water - Mates - Food.
Water - Plants - Meat - Habitat.
Meat - Space - Sunlight - Water.
Habitat - Food - Soil - Mates.
A:2;T:Ecology and Adaptations

20;Starting with the largest, arrange the following list of biological molecules by size, down to the smallest.
Gene - Chromosome - DNA - Cell - Nucleus.
DNA - Cell - Gene - Nucleus - Chromosome.
Cell - Nucleus - DNA - Chromosome - Gene.
Cell - Chromosome - DNA - Nucleus - Gene.
Gene - Nucleus - Cell - Chromosome - DNA.
A:3;T:Genes and Variation

21;Which of the following is not a benefit of genetic engineering?
To improve the quality of food for humans.
To make plants more resistant to certain environmental conditions.
To change the genetic make-up of organisms threatened by extinction.
To improve and advance medical treatments against disease.
To make life saving proteins in the milk of farm animals.
A:3;T:Genes and Variation

22;Which of the following is an example of cloning?
In-vitro fertilisation (IVF).
Growing plants from seeds.
Being a surrogate mother.
Growing a plant by taking a cutting.
Artificial insemination of farm animals.
A:4;T:Genes and Variation

23;What is cholesterol?
A harmful substance that can be eaten and causes you to put on weight.
A type of polyunsaturated fat that is carried around in your blood.
A substance made by the liver that helps your cell membranes to work properly.
A white solid that is added to food to make is taste good, but is harmful if eaten in large amounts.
A type of oil that we need to help our heart to beat normally.
A:3;T:Health

24;What is vaccination?
It is the cure for a viral infection.
It is a way to prevent microbes from entering the body and then multiplying.
It is an altered version of a pathogen that helps you to produce antibodies against a disease.
It is a treatment for measles, mumps and rubella.
It is a man made chemical that can be injected in to the body to treat disease.
A:3;T:Immune System

25;Which scientist suggested that animals evolve by passing on useful changes developed by parents during their lives?
Charles Darwin.
Gregor Mendel.
Jean-Baptiste Lamarck.
Edwin Hubble.
Richard Dawkins.
A:3;T:Evolution

26;How is acid rain formed in the atmosphere?
Hydrochloric acid made by industry evaporates in to the air and eventually forms as acid rain.
Carbon dioxide dissolves in the air and this is then turned in to acid rain.
Sulphur dioxide and nitrogen oxides dissolve in rain and react with oxygen to make rain acidic.
Burning fossil fuels.
Pesticides added to soil enter the rivers and streams. This evaporates and forms acid rain.
A:3;T:Planet Earth

27;Which description best describes the process of natural selection?
It is the process that causes mutation and therefore evolution.
It is the way that animals change over millions of years.
Offspring inherit mutant genes that help them to reproduce better than other organisms.
It is the process that causes adaptations.
Offspring with genes that are best suited for survival get passed on as these individuals can survive and breed.
A:5;T:Evolution

28;What is a brown field site?
It is a site that is usually in towns or cities and can be redeveloped.
It is a site that is protected by law and is likely tocontain valuable plant life.
It is a site that has never been built on before.
It is a site set aside by governments to produce organic food.
It is a site free from pesticides, fungicides and insecticides.
A:1;T:Planet Earth

29;What is the best definition of extinction?
It is the permanent loss of all members of a species from planet Earth.
Is is a temporary loss of all members of a species from planet Earth.
It is an event 65 million years ago that caused the loss of the dinosaurs.
Extinction is a natural process usually caused by disease.
Extinction is an un-natural process caused by meteorite collisions.
A:1;T:Evolution

30;Which 2 gases can be considered as greenhouse gases?
Carbon Dioxide and Nitrogen.
Methane and Carbon Dioxide.
Nitrogen and Carbon Dioxide.
Sulphur Dioxide and Oxygen.
Hydrogen and Water Vapour.
A:2;T:Planet Earth

 

This is man-made, and is added to my database using the following script:

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> 

<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<title>i-School.me</title>
<link rel="stylesheet" type="text/css" href="../css/style.css" />
<style>
body {
margin: 20px;
}
</style>
</head>

<body><?php

include("../includes/connect.php");

//PHP 4.2.x Compatibility function
if (!function_exists('file_get_contents')) {
      function file_get_contents($filename, $incpath = false, $resource_context = null)
      {
          if (false === $fh = fopen($filename, 'rb', $incpath)) {
              trigger_error('file_get_contents() failed to open stream: No such file or directory', E_USER_WARNING);
              return false;
          }
  
          clearstatcache();
          if ($fsize = @filesize($filename)) {
              $data = fread($fh, $fsize);
          } else {
              $data = '';
              while (!feof($fh)) {
                  $data .= fread($fh, 8192);
              }
          }
  
          fclose($fh);
          return $data;
      }
  }

if($_GET['paper'] == "") {
} else {
$file = file_get_contents($_GET['paper'].'.txt', true);
$file = explode("

", $file);

$query = mysql_query("CREATE TABLE ".$_GET['paper']."test (
  id smallint(100) NOT NULL auto_increment,
  questionno smallint(100) NOT NULL default '0',
  question varchar(200) NOT NULL default '',
  answer1 varchar(200) NOT NULL default '',
  answer2 varchar(200) NOT NULL default '',
  answer3 varchar(200) NOT NULL default '',
  answer4 varchar(200) NOT NULL default '',
  answer5 varchar(200) NOT NULL default '',
  correct varchar(100) NOT NULL default '',
  topic varchar(100) NOT NULL default '',
  PRIMARY KEY  (id)
)") or die(mysql_error());
foreach($file as $questionstring) {
$questionstring = explode("
", $questionstring);
$line1 = explode(";", $questionstring[0]);
$questionno = $line1[0];
$question = addslashes($line1[1]);
$answer1 = addslashes($questionstring[1]);
$answer2 = addslashes($questionstring[2]);
$answer3 = addslashes($questionstring[3]);
$answer4 = addslashes($questionstring[4]);
$answer5 = addslashes($questionstring[5]);
$line6 = explode(";", $questionstring[6]);
$correct = str_replace("A:", "", $line6[0]);
$topic = str_replace("T:", "", $line6[1]);
echo "<b>".$questionno.". ".$question."</b><br />[ ] ".$answer1."<br />[ ] ".$answer2."<br />[ ] ".$answer3."<br />[ ] ".$answer4."<br />[ ] ".$answer5."<br /><br /><b>Correct: </b>".$correct." - <b>Topic: </b>".$topic."<br /><br />";
$query = mysql_query("INSERT INTO ".$_GET['paper']."test VALUES (NULL, '".$questionno."', '".$question."', '".$answer1."', '".$answer2."', '".$answer3."', '".$answer4."', '".$answer5."', '".$correct."', '".$topic."')") or die("Could not insert question #".$questionno." because: ".mysql_error());
}
}

?>
</body>
</html>

 

Now, the first script does not parse the document correctly because of various numbers of newlines, and returns the values incorrectly (explodes newlines incorrectly, no answers to questions, answers as the questions, etc). I need you to help me figure out how to parse these files and format them like the correct version I posted above. Also, I have answer documents and related topic documents, which need to be cross referenced with each question to make the final line for each question - "A:2(This is the answer);T:Nerves and Communication(This is the topic)". They are both seperate documents, which are all attatched.

 

Thanks, Luke.

 

[attachment deleted by admin]

Link to comment
Share on other sites

you'd hafta learn regular expressions lol.. writing it for you would be of little beneficial for you, uhm, try exploding by new lines, thats another way..

 

explode("\n",$theQuestionsAndAnswersAndAllDatWhiteSpace);

 

den loop thru dat and check for data in da array elements.. if it has data throw it into an array.. every 5 elements start a new array..

Link to comment
Share on other sites

you'd hafta learn regular expressions lol.. writing it for you would be of little beneficial for you, uhm, try exploding by new lines, thats another way..

 

explode("\n",$theQuestionsAndAnswersAndAllDatWhiteSpace);

 

den loop thru dat and check for data in da array elements.. if it has data throw it into an array.. every 5 elements start a new array..

 

If you ever were to buy a script, don't buy it from someone who can't even muster up the integrity to spell the English language correctly. It would reflect in the finished product, trust me.

 

"Dat's all, folks!"

Link to comment
Share on other sites

If you ever were to buy a script, don't buy it from someone who can't even muster up the integrity to spell the English language correctly. It would reflect in the finished product, trust me.

 

QFT.

 

I am trying to make a simple document parser to make it easy for me to import tests into a new learning system I am creating. I have a document, like so (saved in Plain Text format so that PHP can make sense of it):

 

See that's actually where your problem begins.  PHP (or anything else for that matter) is not very good at "making sense" of Plain Text.  If you have the ability to save it with tags or delimiters then your problem can very easily be solved.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.